CliMA · juliasloan25 · Mar 1, 2025 · Mar 3, 2025 · Mar 6, 2025 · Mar 6, 2025
diff --git a/.buildkite/pipeline.yml b/.buildkite/pipeline.yml
@@ -67,16 +67,6 @@ steps:
   - group: "Unit Tests"
     steps:
 
-      - label: "MPI Checkpointer unit tests"
-        key: "checkpointer_mpi_tests"
-        command: "srun julia --color=yes --project=test/ test/mpi_tests/checkpointer_mpi_tests.jl"
-        timeout_in_minutes: 20
-        env:
-          CLIMACOMMS_CONTEXT: "MPI"
-        agents:
-          slurm_ntasks: 2
-          slurm_mem: 16GB
-
       - label: "MPI Utilities unit tests"
         key: "utilities_mpi_tests"
         command: "srun julia --color=yes --project=test/ test/utilities_tests.jl"
@@ -97,6 +87,7 @@ steps:
         agents:
           slurm_ntasks: 1
           slurm_gres: "gpu:1"
+          slurm_mem: 24GB
 
   - group: "GPU: experiments/ClimaEarth/ unit tests and global bucket"
     steps:
@@ -109,6 +100,27 @@ steps:
           slurm_gres: "gpu:1"
           slurm_mem: 20GB
 
+  - group: "ClimaEarth test"
+    steps:
+      - label: "ClimaEarth test"
+        key: "restarts"
+        command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/runtests.jl"
+        agents:
+          slurm_mem: 16GB
+
+      - label: "MPI restarts"
+        key: "mpi_restarts"
+        command: "srun julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
+        env:
+          CLIMACOMMS_CONTEXT: "MPI"
+        timeout_in_minutes: 40
+        soft_fail:
+          - exit_status: -1
+          - exit_status: 255
+        agents:
+          slurm_ntasks: 2
+          slurm_mem: 32G
+
   - group: "Integration Tests"
     steps:
       # SLABPLANET EXPERIMENTS
@@ -218,7 +230,7 @@ steps:
           CLIMACOMMS_CONTEXT: "MPI"
         agents:
           slurm_ntasks: 4
-          slurm_mem_per_cpu: 8GB
+          slurm_mem_per_cpu: 12GB
 
       # short high-res performance test
       - label: "Unthreaded AMIP FINE" # also reported by longruns with a flame graph

diff --git a/NEWS.md b/NEWS.md
@@ -6,6 +6,23 @@ ClimaCoupler.jl Release Notes
 
 ### ClimaCoupler features
 
+#### Shared component `dt` can be overwritten for individual components
+Previously, we required that the user either specify a shared `dt` to be
+used by all component models, or specify values for all component models
+(`dt_atmos`, `dt_ocean`, `dt_seaice`, `dt_land`). If fewer than 4
+model-specific timesteps were provided, they would be discarded and
+`dt` would be used uniformly instead. After this PR, if a user provides
+fewer than 4 model-specific timesteps, they will be used for those models,
+and the generic `dt` will be used for any models that don't have a more
+specific timestep.
+This makes choosing the timesteps simpler and allows us to easily set
+specific `dt`s only for the models we're interested in.
+
+This PR also changes the prescribed ocean and sea ice simulations
+to update the stored SST/SIC based on a daily schedule. Now, the
+input data will be interpolated from monthly to daily instead of
+to every timestep.
+
 #### Add default `get_field` methods for surface models PR[#1210](https://github.com/CliMA/ClimaCoupler.jl/pull/1210)
 Add default methods for `get_field` methods that are commonly
 not extended for surface models. These return reasonable default
@@ -21,6 +38,20 @@ TOA radiation and net precipitation are added only if conservation is enabled.
 The coupler fields are also now stored as a ClimaCore Field of NamedTuples,
 rather than as a NamedTuple of ClimaCore Fields.
 
+#### Restart simulations with JLD2 files PR[#1179](https://github.com/CliMA/ClimaCoupler.jl/pull/1179)
+
+`ClimaCoupler` can now use `JLD2` files to save state and cache for its model
+component, allowing it to restart from saved checkpoints. Some restrictions
+apply:
+
+- The number of MPI processes has to remain the same across checkpoints
+- Restart files are generally not portable across machines, julia versions, and package versions
+- Adding/changing new component models will probably require adding/changing code
+
+Please, refer to the
+[documentation](https://clima.github.io/ClimaCoupler.jl/dev/checkpointer/) for
+more information.
+
 #### Remove extra `get_field` functions PR[#1203](https://github.com/CliMA/ClimaCoupler.jl/pull/1203)
 Removes the `get_field` functions for `air_density` for all models, which
 were unused except for the `BucketSimulation` method, which is replaced by a

diff --git a/Project.toml b/Project.toml
@@ -8,6 +8,7 @@ ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
 ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
 ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
+JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
 Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
 SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
 StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
@@ -16,9 +17,10 @@ Thermodynamics = "b60c26fb-14c3-4610-9d3e-2d17fe7ff00c"
 
 [compat]
 ClimaComms = "0.6.2"
-ClimaCore = "0.14.23"
+ClimaCore = "0.14.25"
 ClimaUtilities = "0.1.22"
 Dates = "1"
+JLD2 = "0.5.11"
 Logging = "1"
 SciMLBase = "2.11"
 StaticArrays = "1.6"

diff --git a/config/ci_configs/amip_component_dts.yml b/config/ci_configs/amip_component_dts.yml
@@ -3,7 +3,7 @@ co2: "maunaloa"
 dt_atmos: "150secs"
 dt_cpl: "150secs"
 dt_land: "50secs"
-dt_ocean: "30secs"
+dt_ocean: "300secs"
 dt_rad: "1hours"
 dt_save_to_sol: "1days"
 dt_seaice: "37.5secs"

diff --git a/docs/src/checkpointer.md b/docs/src/checkpointer.md
@@ -1,12 +1,137 @@
 # Checkpointer
 
-This module contains general functions for logging the model states and restarting simulations. The `Checkpointer` uses `ClimaCore.InputOutput` infrastructure, which allows it to handle arbitrarily distributed logging and restart setups.
+## How to save and restart from checkpoints
+
+`ClimaCoupler` supports saving and reading simulation checkpoints. This is
+useful to split a long simulation into smaller, more manageable chunks.
+
+Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a
+`checkpoints` folder in the simulation output. See
+[`Utilities.setup_output_dirs`](@ref) for more information.
+
+!!! known limitations
+
+    - The number of MPI processes has to remain the same across checkpoints
+    - Restart files are generally not portable across machines, julia versions, and package versions
+    - Adding/changing new component models will probably require adding/changing code
+
+### Saving checkpoints
+
+If you are running a model (such as AMIP), chances are that you can enable
+checkpointing just by setting a command-line argument; The `checkpoint_dt`
+option controls how frequently a checkpoint should be produced.
+
+If your model does not come with this option already, you can checkpoint the
+simulation by adding a callback that calls the
+[`Checkpointer.checkpoint_sims`](@ref) function.
+
+For example, to add a callback to checkpoint every hour of simulated time,
+assuming you have a `start_date`
+```julia
+import Dates
+
+import ClimaCoupler: Checkpointer, TimeManager
+import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule 
+
+schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
+checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)
+
+# In the coupling loop:
+TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)
+```
+
+### Reading checkpoints
+
+There are two ways to restart a simulation from checkpoints. By default,
+`ClimaCoupler` tries finding suitable checkpoints and automatically use them.
+Alternatively, you can specify a directory `restart_dir` and a simulation time
+`restart_t` and restart from files saved in the given directory at the given
+time. If the model you are running supports writing checkpoints via command-line
+argument, it will probably also support reading them. In this case, the
+arguments `restart_dir` and `restart_t` identify the path of the top level
+directory containing all the checkpoint files and the simulated times in second.
+
+If the model does not support directly reading a checkpoint, the `Checkpointer`
+module provides a straightforward way to add this feature.
+[`Checkpointer.restart!`](@ref) takes a coupled simulation, a `restart_dir`, and
+a `restart_t` and overwrites the content of the coupled simulation with what is
+in the checkpoint. 
+
+## Developer notes
+
+In theory, the state of the component models should fully determine the state of
+the coupled simulation and one should be able to restart a coupled simulation
+just by using the states of the component models. Unfortunately, this is
+currently not the case in `ClimaCoupler`. The main reason for this is the
+complex interdependencies between component models and within `ClimaAtmos` which
+make the initialization step inconsistent. For example, in a coupled simulation,
+the surface albedo should be determined by the surface models and used by the
+atmospheric model for radiation transfer, but `ClimaAtmos` also tries to set the
+surface albedo (since it has to do so when run in standalone mode). In addition
+to this, `ClimaAtmos` has a large cache that has internal interdependencies that
+are hard to disentangle, and changing a field might require changing some other
+field in a different part of the cache. As a result, it is not easy for
+`ClimaCoupler` to consistently do initialization from a cold state. To conclude,
+restarting a simulation exclusively using the states of the component models is
+currently impossible.
+
+Given that restarting a simulation from the state is impossible, `ClimaCoupler`
+needs to save the states and the caches. Let us review how we use
+`ClimaCore.InputOutput` and `JLD2` package to accomplish this.
+
+`ClimaCore.InputOutput` provides a loss-less way to save the content of certain
+`ClimaCore` objects to HDF5 files. Objects saved in this way are not tied to a
+particular computing device or configuration. When running with MPI,
+`ClimaCore.InputOutput` are also efficiently written in parallel.
+
+Unfortunately, `ClimaCore.InputOutput` only supports certain objects, such as
+`Field`s and `Space`s, but the cache in component models is more complex than
+this and contains complex objects with highly stateful quantities (e.g., C
+pointers). Because of this, model states are saved to HDF5 but caches must be
+saved to JLD2 files.
+
+`JLD2` allows us to save more complex objects without writing specific
+serialization methods for every struct. `JLD2` allows us to take a big step
+forward, but there are still several challenges that need to be solved:
+1. `JLD2` does not support CUDA natively. To go around this, we have to move
+  everything onto the CPU first. Then, when the data is read back, we have to
+  move it back to the GPU.
+2. `JLD2` does not support MPI natively. To go around this, each process writes
+  its `jld2` checkpoint and reads it back. This introduces the constraint that
+  the number of MPI processes cannot change across restarts.
+3. Some quantities are best not saved and read (for example, anything with
+  pointers). For this, we write a recursive function that traverses the cache
+  and only restores quantities of a certain type (typically, `ClimaCore`
+  objects)
+
+Point 3. adds significant amount of code and requires component models to
+specify how their cache has to be restored.
+
+If you are adding a component model, you have to extend the
+```
+Checkpointer.get_model_prog_state
+Checkpointer.get_model_cache
+Checkpointer.restore_cache!
+```
+methods. 
+
+`ClimaCoupler` moves objects to the CPU with `Adapt(Array, x)`. `Adapt`
+traverses the object recursively, and proper `Adapt` methods have to be defined
+for every object involved in the chain. The easiest way to do this is using the
+`Adapt.@adapt_structure` macro, which defines a recursive Adapt for the given
+object.
+
+Types to watch for:
+- `MPI` related objects (e.g., `MPICommsContext`)
+- `TimeVaryingInputs` (because they contain `NCDatasets`, which contain pointers
+  to files)
 
 ## Checkpointer API
 
 ```@docs
     ClimaCoupler.Checkpointer.get_model_prog_state
-    ClimaCoupler.Checkpointer.restart_model_state!
-    ClimaCoupler.Checkpointer.checkpoint_model_state
+    ClimaCoupler.Checkpointer.get_model_cache
+    ClimaCoupler.Checkpointer.restart!
     ClimaCoupler.Checkpointer.checkpoint_sims
+    ClimaCoupler.Checkpointer.t_start_from_checkpoint
 ```