Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update ocean, sea ice every 1 day #1219

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 23 additions & 11 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,16 +67,6 @@ steps:
- group: "Unit Tests"
steps:

- label: "MPI Checkpointer unit tests"
key: "checkpointer_mpi_tests"
command: "srun julia --color=yes --project=test/ test/mpi_tests/checkpointer_mpi_tests.jl"
timeout_in_minutes: 20
env:
CLIMACOMMS_CONTEXT: "MPI"
agents:
slurm_ntasks: 2
slurm_mem: 16GB

- label: "MPI Utilities unit tests"
key: "utilities_mpi_tests"
command: "srun julia --color=yes --project=test/ test/utilities_tests.jl"
Expand All @@ -97,6 +87,7 @@ steps:
agents:
slurm_ntasks: 1
slurm_gres: "gpu:1"
slurm_mem: 24GB

- group: "GPU: experiments/ClimaEarth/ unit tests and global bucket"
steps:
Expand All @@ -109,6 +100,27 @@ steps:
slurm_gres: "gpu:1"
slurm_mem: 20GB

- group: "ClimaEarth test"
steps:
- label: "ClimaEarth test"
key: "restarts"
command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/runtests.jl"
agents:
slurm_mem: 16GB

- label: "MPI restarts"
key: "mpi_restarts"
command: "srun julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
env:
CLIMACOMMS_CONTEXT: "MPI"
timeout_in_minutes: 40
soft_fail:
- exit_status: -1
- exit_status: 255
agents:
slurm_ntasks: 2
slurm_mem: 32G

- group: "Integration Tests"
steps:
# SLABPLANET EXPERIMENTS
Expand Down Expand Up @@ -218,7 +230,7 @@ steps:
CLIMACOMMS_CONTEXT: "MPI"
agents:
slurm_ntasks: 4
slurm_mem_per_cpu: 8GB
slurm_mem_per_cpu: 12GB

# short high-res performance test
- label: "Unthreaded AMIP FINE" # also reported by longruns with a flame graph
Expand Down
31 changes: 31 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,23 @@ ClimaCoupler.jl Release Notes

### ClimaCoupler features

#### Shared component `dt` can be overwritten for individual components
Previously, we required that the user either specify a shared `dt` to be
used by all component models, or specify values for all component models
(`dt_atmos`, `dt_ocean`, `dt_seaice`, `dt_land`). If fewer than 4
model-specific timesteps were provided, they would be discarded and
`dt` would be used uniformly instead. After this PR, if a user provides
fewer than 4 model-specific timesteps, they will be used for those models,
and the generic `dt` will be used for any models that don't have a more
specific timestep.
This makes choosing the timesteps simpler and allows us to easily set
specific `dt`s only for the models we're interested in.

This PR also changes the prescribed ocean and sea ice simulations
to update the stored SST/SIC based on a daily schedule. Now, the
input data will be interpolated from monthly to daily instead of
to every timestep.

#### Add default `get_field` methods for surface models PR[#1210](https://github.com/CliMA/ClimaCoupler.jl/pull/1210)
Add default methods for `get_field` methods that are commonly
not extended for surface models. These return reasonable default
Expand All @@ -21,6 +38,20 @@ TOA radiation and net precipitation are added only if conservation is enabled.
The coupler fields are also now stored as a ClimaCore Field of NamedTuples,
rather than as a NamedTuple of ClimaCore Fields.

#### Restart simulations with JLD2 files PR[#1179](https://github.com/CliMA/ClimaCoupler.jl/pull/1179)

`ClimaCoupler` can now use `JLD2` files to save state and cache for its model
component, allowing it to restart from saved checkpoints. Some restrictions
apply:

- The number of MPI processes has to remain the same across checkpoints
- Restart files are generally not portable across machines, julia versions, and package versions
- Adding/changing new component models will probably require adding/changing code

Please, refer to the
[documentation](https://clima.github.io/ClimaCoupler.jl/dev/checkpointer/) for
more information.

#### Remove extra `get_field` functions PR[#1203](https://github.com/CliMA/ClimaCoupler.jl/pull/1203)
Removes the `get_field` functions for `air_density` for all models, which
were unused except for the `BucketSimulation` method, which is replaced by a
Expand Down
4 changes: 3 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Expand All @@ -16,9 +17,10 @@ Thermodynamics = "b60c26fb-14c3-4610-9d3e-2d17fe7ff00c"

[compat]
ClimaComms = "0.6.2"
ClimaCore = "0.14.23"
ClimaCore = "0.14.25"
ClimaUtilities = "0.1.22"
Dates = "1"
JLD2 = "0.5.11"
Logging = "1"
SciMLBase = "2.11"
StaticArrays = "1.6"
Expand Down
2 changes: 1 addition & 1 deletion config/ci_configs/amip_component_dts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ co2: "maunaloa"
dt_atmos: "150secs"
dt_cpl: "150secs"
dt_land: "50secs"
dt_ocean: "30secs"
dt_ocean: "300secs"
dt_rad: "1hours"
dt_save_to_sol: "1days"
dt_seaice: "37.5secs"
Expand Down
131 changes: 128 additions & 3 deletions docs/src/checkpointer.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,137 @@
# Checkpointer

This module contains general functions for logging the model states and restarting simulations. The `Checkpointer` uses `ClimaCore.InputOutput` infrastructure, which allows it to handle arbitrarily distributed logging and restart setups.
## How to save and restart from checkpoints

`ClimaCoupler` supports saving and reading simulation checkpoints. This is
useful to split a long simulation into smaller, more manageable chunks.

Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a
`checkpoints` folder in the simulation output. See
[`Utilities.setup_output_dirs`](@ref) for more information.

!!! known limitations

- The number of MPI processes has to remain the same across checkpoints
- Restart files are generally not portable across machines, julia versions, and package versions
- Adding/changing new component models will probably require adding/changing code

### Saving checkpoints

If you are running a model (such as AMIP), chances are that you can enable
checkpointing just by setting a command-line argument; The `checkpoint_dt`
option controls how frequently a checkpoint should be produced.

If your model does not come with this option already, you can checkpoint the
simulation by adding a callback that calls the
[`Checkpointer.checkpoint_sims`](@ref) function.

For example, to add a callback to checkpoint every hour of simulated time,
assuming you have a `start_date`
```julia
import Dates

import ClimaCoupler: Checkpointer, TimeManager
import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule

schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)

# In the coupling loop:
TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)
```

### Reading checkpoints

There are two ways to restart a simulation from checkpoints. By default,
`ClimaCoupler` tries finding suitable checkpoints and automatically use them.
Alternatively, you can specify a directory `restart_dir` and a simulation time
`restart_t` and restart from files saved in the given directory at the given
time. If the model you are running supports writing checkpoints via command-line
argument, it will probably also support reading them. In this case, the
arguments `restart_dir` and `restart_t` identify the path of the top level
directory containing all the checkpoint files and the simulated times in second.

If the model does not support directly reading a checkpoint, the `Checkpointer`
module provides a straightforward way to add this feature.
[`Checkpointer.restart!`](@ref) takes a coupled simulation, a `restart_dir`, and
a `restart_t` and overwrites the content of the coupled simulation with what is
in the checkpoint.

## Developer notes

In theory, the state of the component models should fully determine the state of
the coupled simulation and one should be able to restart a coupled simulation
just by using the states of the component models. Unfortunately, this is
currently not the case in `ClimaCoupler`. The main reason for this is the
complex interdependencies between component models and within `ClimaAtmos` which
make the initialization step inconsistent. For example, in a coupled simulation,
the surface albedo should be determined by the surface models and used by the
atmospheric model for radiation transfer, but `ClimaAtmos` also tries to set the
surface albedo (since it has to do so when run in standalone mode). In addition
to this, `ClimaAtmos` has a large cache that has internal interdependencies that
are hard to disentangle, and changing a field might require changing some other
field in a different part of the cache. As a result, it is not easy for
`ClimaCoupler` to consistently do initialization from a cold state. To conclude,
restarting a simulation exclusively using the states of the component models is
currently impossible.

Given that restarting a simulation from the state is impossible, `ClimaCoupler`
needs to save the states and the caches. Let us review how we use
`ClimaCore.InputOutput` and `JLD2` package to accomplish this.

`ClimaCore.InputOutput` provides a loss-less way to save the content of certain
`ClimaCore` objects to HDF5 files. Objects saved in this way are not tied to a
particular computing device or configuration. When running with MPI,
`ClimaCore.InputOutput` are also efficiently written in parallel.

Unfortunately, `ClimaCore.InputOutput` only supports certain objects, such as
`Field`s and `Space`s, but the cache in component models is more complex than
this and contains complex objects with highly stateful quantities (e.g., C
pointers). Because of this, model states are saved to HDF5 but caches must be
saved to JLD2 files.

`JLD2` allows us to save more complex objects without writing specific
serialization methods for every struct. `JLD2` allows us to take a big step
forward, but there are still several challenges that need to be solved:
1. `JLD2` does not support CUDA natively. To go around this, we have to move
everything onto the CPU first. Then, when the data is read back, we have to
move it back to the GPU.
2. `JLD2` does not support MPI natively. To go around this, each process writes
its `jld2` checkpoint and reads it back. This introduces the constraint that
the number of MPI processes cannot change across restarts.
3. Some quantities are best not saved and read (for example, anything with
pointers). For this, we write a recursive function that traverses the cache
and only restores quantities of a certain type (typically, `ClimaCore`
objects)

Point 3. adds significant amount of code and requires component models to
specify how their cache has to be restored.

If you are adding a component model, you have to extend the
```
Checkpointer.get_model_prog_state
Checkpointer.get_model_cache
Checkpointer.restore_cache!
```
methods.

`ClimaCoupler` moves objects to the CPU with `Adapt(Array, x)`. `Adapt`
traverses the object recursively, and proper `Adapt` methods have to be defined
for every object involved in the chain. The easiest way to do this is using the
`Adapt.@adapt_structure` macro, which defines a recursive Adapt for the given
object.

Types to watch for:
- `MPI` related objects (e.g., `MPICommsContext`)
- `TimeVaryingInputs` (because they contain `NCDatasets`, which contain pointers
to files)

## Checkpointer API

```@docs
ClimaCoupler.Checkpointer.get_model_prog_state
ClimaCoupler.Checkpointer.restart_model_state!
ClimaCoupler.Checkpointer.checkpoint_model_state
ClimaCoupler.Checkpointer.get_model_cache
ClimaCoupler.Checkpointer.restart!
ClimaCoupler.Checkpointer.checkpoint_sims
ClimaCoupler.Checkpointer.t_start_from_checkpoint
```
Loading
Loading