Add ADIOS2 IO checkpoint support #173

ia267 · 2025-04-29T14:24:42Z

This PR add checkpoint/restart and snapshot functionality to x3d2 using ADIOS2. This enables efficient parallel I/O for large-scale simulations.

Implementation details:

adios2_io.f90 - ADIOS2 wrapper module
checkpoint_io.f90 - checkpoint manager that integrates with the solver
Added build system integration with both system and built-in ADIOS2 options
Added a case to test ADIOS2 read and write
Modified base case to write and read checkpoints

Build system changes:

Added ENABLE_ADIOS2 CMake option
Added USE_SYSTEM_ADIOS2 to control using system vs. built-in ADIOS2

Future work:

Copilot

Copilot reviewed 2 out of 16 changed files in this pull request and generated no comments.

Files not reviewed (14)

CMakeLists.txt: Language not supported
cmake/adios2/FindADIOS2.cmake: Language not supported
cmake/adios2/downloadBuildAdios2.cmake.in: Language not supported
docs/source/user/advanced_build.rst: Language not supported
docs/source/user/index.rst: Language not supported
docs/source/user/input_file.rst: Language not supported
src/CMakeLists.txt: Language not supported
src/adios2/checkpoint_dummy.f90: Language not supported
src/case/base_case.f90: Language not supported
src/common.f90: Language not supported
src/config.f90: Language not supported
src/solver.f90: Language not supported
tests/CMakeLists.txt: Language not supported
tests/test_adios2_read_write.f90: Language not supported

semi-h

I did a quick review an focused mainly on minor things, API, and general strategy. I'll do another one focusing on the design in the following days. For now I think the separation between adios2 and the checkpoint manager is very good in terms of design and can potantially allow a separate writer than adios2 in the future.

semi-h · 2025-05-06T15:07:01Z

src/case/base_case.f90

@@ -90,10 +95,20 @@ subroutine case_init(self, backend, mesh, host_allocator)

    self%solver = init(backend, mesh, host_allocator)

-    call self%initial_conditions()
+    self%checkpoint_mgr = create_checkpoint_manager(MPI_COMM_WORLD)


We can add the following in m_checkpoint_manager

interface checkpoint_manager_t module procedure create_checkpoint_manager end interface checkpoint_manager_t

And then this line would be

self%checkpoint_mgr = checkpoint_manager_t(MPI_COMM_WORLD)

and help us avoid use m_checkpoint_manager, only :: create_checkpoint_manager bit.

semi-h · 2025-05-06T15:12:10Z

src/case/base_case.f90

+    call self%checkpoint_mgr%handle_restart(self%solver, MPI_COMM_WORLD)

+    if (.not. self%checkpoint_mgr%is_restart()) call self%initial_conditions()


It seems like handle_restart does nothing if the simulation is not restarting, so it will be better to have these two in a single if else block based on is_restart().

semi-h · 2025-05-06T15:16:05Z

src/solver.f90

@@ -123,6 +124,7 @@ function init(backend, mesh, host_allocator) result(solver)
    solver%n_iters = solver_cfg%n_iters
    solver%n_output = solver_cfg%n_output
    solver%ngrid = product(solver%mesh%get_global_dims(VERT))
+    solver%current_iter = 0


Is already initialised to 0 when declared so this line can be removed.

semi-h · 2025-05-06T16:30:27Z

src/adios2/checkpoint_io.f90

+      end if
+    end if
+
+    file = self%adios2_writer%open(filename, adios2_mode_write, comm_to_use)


Wouldn't the file overwritten with this call or at least the data inside would be overwritten with the following write_data calls? I'm wondering because just above you inquire if the file exists and delete manually if so.
Also this is what we should do later on but @slaizet was mentioning something like keeping the last checkpoint until a new one is successfully written to make sure if something goes wrong with writing after deleting we don't lose it.

semi-h · 2025-05-06T16:34:03Z

src/adios2/checkpoint_io.f90

+    call self%generate_coordinates( &
+      solver, file, shape_dims, start_dims, count_dims &
+      )


Can we avoid writing coordinates to each snapshot file? We have u, v, w and together with this x, y, z coordinates it would double the total size of a snapshot. With structured format paraview only requires 3x 1D arrays with dimensions (nx), (ny), (nz) individually to construct the entire field, saving quite a lot compared to a (nx, ny, nz) sized array.

Also allocation/deallocation inside generate_coordinates would kick the memory requirement +3 units each time we write and this can trigger out of memory if the simulation barely fits.

semi-h · 2025-05-06T16:49:07Z

src/adios2/checkpoint_io.f90

+    do i_field = 1, size(field_names)
+      host_field => get_field_data(solver, trim(field_names(i_field)))
+      if (.not. associated(host_field)) cycle
+
+      call write_single_field(trim(field_names(i_field)), host_field)
+    end do


I think it would be better to change the arguments of write_fields subroutine a little to make it more functional, for example when outputting a field other than u/v/w. Instead of passing solver and implicitly handling u/v/w via solver%u/v/w, we can pass an array of field_t, sized same as field_names such that they correspond to the name of each individual field. Then it will be easy to output scalar fields or any other field we're interested in just by changing the arguments.

ia267 and others added 18 commits April 28, 2025 20:44

add adios2 to build system + simple test case

94c6b36

modify cmake to build adios if not found locally

6726bcc

create adios2 moduel for io

06b2de0

refactored test case following adios2 base class

73adaa9

implement writing field data in adios2

a39bd34

add checkpoint changes

b663a9e

implement checkpoint verification

8866420

add snapshots that can be viewed in paraview

2cb682f

enable cuda for adios2

085b50c

fix issue with restart after rebasing

c154be6

handle builds without adios2

0f1bdf0

use dp instead of real64 and create i8 parameter

fd99223

add striding for snapshots

33fd025

choose between using system-installed adios2 or a build from source

55074be

fix issue with striding for snapshots

e91a0ad

format after fprettify

4330ded

add checkpoints to user docs

b4bed5c

remove duplicated error handling

8f0f364

ia267 added this to the Implement ADIOS2-based IO milestone Apr 29, 2025

fix formatting issues from fprettify

4d1c913

ia267 self-assigned this Apr 29, 2025

ia267 added the infrastructure Software infrastructure label Apr 29, 2025

ia267 requested review from semi-h and Copilot April 30, 2025 09:09

Copilot AI reviewed Apr 30, 2025

View reviewed changes

semi-h reviewed May 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ADIOS2 IO checkpoint support #173

Add ADIOS2 IO checkpoint support #173

ia267 commented Apr 29, 2025 •

edited

Loading

Copilot AI left a comment

semi-h left a comment

semi-h May 6, 2025

semi-h May 6, 2025

semi-h May 6, 2025

semi-h May 6, 2025

semi-h May 6, 2025

semi-h May 6, 2025

semi-h May 6, 2025

		call self%checkpoint_mgr%handle_restart(self%solver, MPI_COMM_WORLD)

		if (.not. self%checkpoint_mgr%is_restart()) call self%initial_conditions()

Add ADIOS2 IO checkpoint support #173

Are you sure you want to change the base?

Add ADIOS2 IO checkpoint support #173

Conversation

ia267 commented Apr 29, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

semi-h left a comment

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

semi-h May 6, 2025

Choose a reason for hiding this comment

ia267 commented Apr 29, 2025 •

edited

Loading