Skip to content

Add ADIOS2 IO checkpoint support #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

ia267
Copy link
Collaborator

@ia267 ia267 commented Apr 29, 2025

This PR add checkpoint/restart and snapshot functionality to x3d2 using ADIOS2. This enables efficient parallel I/O for large-scale simulations.

Implementation details:

  • adios2_io.f90 - ADIOS2 wrapper module
  • checkpoint_io.f90 - checkpoint manager that integrates with the solver
  • Added build system integration with both system and built-in ADIOS2 options
  • Added a case to test ADIOS2 read and write
  • Modified base case to write and read checkpoints

Build system changes:

  • Added ENABLE_ADIOS2 CMake option
  • Added USE_SYSTEM_ADIOS2 to control using system vs. built-in ADIOS2

Future work:

@ia267 ia267 added this to the Implement ADIOS2-based IO milestone Apr 29, 2025
@ia267 ia267 self-assigned this Apr 29, 2025
@ia267 ia267 added the infrastructure Software infrastructure label Apr 29, 2025
@ia267 ia267 requested review from semi-h and Copilot April 30, 2025 09:09
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 16 changed files in this pull request and generated no comments.

Files not reviewed (14)
  • CMakeLists.txt: Language not supported
  • cmake/adios2/FindADIOS2.cmake: Language not supported
  • cmake/adios2/downloadBuildAdios2.cmake.in: Language not supported
  • docs/source/user/advanced_build.rst: Language not supported
  • docs/source/user/index.rst: Language not supported
  • docs/source/user/input_file.rst: Language not supported
  • src/CMakeLists.txt: Language not supported
  • src/adios2/checkpoint_dummy.f90: Language not supported
  • src/case/base_case.f90: Language not supported
  • src/common.f90: Language not supported
  • src/config.f90: Language not supported
  • src/solver.f90: Language not supported
  • tests/CMakeLists.txt: Language not supported
  • tests/test_adios2_read_write.f90: Language not supported

Copy link
Member

@semi-h semi-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick review an focused mainly on minor things, API, and general strategy. I'll do another one focusing on the design in the following days. For now I think the separation between adios2 and the checkpoint manager is very good in terms of design and can potantially allow a separate writer than adios2 in the future.

@@ -90,10 +95,20 @@ subroutine case_init(self, backend, mesh, host_allocator)

self%solver = init(backend, mesh, host_allocator)

call self%initial_conditions()
self%checkpoint_mgr = create_checkpoint_manager(MPI_COMM_WORLD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the following in m_checkpoint_manager

  interface checkpoint_manager_t
    module procedure create_checkpoint_manager
  end interface checkpoint_manager_t

And then this line would be

self%checkpoint_mgr = checkpoint_manager_t(MPI_COMM_WORLD)

and help us avoid use m_checkpoint_manager, only :: create_checkpoint_manager bit.

Comment on lines +99 to +101
call self%checkpoint_mgr%handle_restart(self%solver, MPI_COMM_WORLD)

if (.not. self%checkpoint_mgr%is_restart()) call self%initial_conditions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like handle_restart does nothing if the simulation is not restarting, so it will be better to have these two in a single if else block based on is_restart().

@@ -123,6 +124,7 @@ function init(backend, mesh, host_allocator) result(solver)
solver%n_iters = solver_cfg%n_iters
solver%n_output = solver_cfg%n_output
solver%ngrid = product(solver%mesh%get_global_dims(VERT))
solver%current_iter = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is already initialised to 0 when declared so this line can be removed.

end if
end if

file = self%adios2_writer%open(filename, adios2_mode_write, comm_to_use)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the file overwritten with this call or at least the data inside would be overwritten with the following write_data calls? I'm wondering because just above you inquire if the file exists and delete manually if so.
Also this is what we should do later on but @slaizet was mentioning something like keeping the last checkpoint until a new one is successfully written to make sure if something goes wrong with writing after deleting we don't lose it.

Comment on lines +320 to +322
call self%generate_coordinates( &
solver, file, shape_dims, start_dims, count_dims &
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid writing coordinates to each snapshot file? We have u, v, w and together with this x, y, z coordinates it would double the total size of a snapshot. With structured format paraview only requires 3x 1D arrays with dimensions (nx), (ny), (nz) individually to construct the entire field, saving quite a lot compared to a (nx, ny, nz) sized array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also allocation/deallocation inside generate_coordinates would kick the memory requirement +3 units each time we write and this can trigger out of memory if the simulation barely fits.

Comment on lines +591 to +596
do i_field = 1, size(field_names)
host_field => get_field_data(solver, trim(field_names(i_field)))
if (.not. associated(host_field)) cycle

call write_single_field(trim(field_names(i_field)), host_field)
end do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to change the arguments of write_fields subroutine a little to make it more functional, for example when outputting a field other than u/v/w. Instead of passing solver and implicitly handling u/v/w via solver%u/v/w, we can pass an array of field_t, sized same as field_names such that they correspond to the name of each individual field. Then it will be easy to output scalar fields or any other field we're interested in just by changing the arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Software infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants