HDF5: Explicit control over chunking (#1591)

* Chunking specification per dataset, explicit specification Still need to filter out the warnings better * Json internal * Properly warn about unused items * Maybe expose this publicly? * CI Fixes * Documentation * Testing * Revert "Maybe expose this publicly?" This reverts commit f00baa7. * Remove todo comment
openPMD · Feb 26, 2024 · 30e5bde · 30e5bde
1 parent a0eca32
commit 30e5bde
Show file tree

Hide file tree

Showing 8 changed files with 277 additions and 111 deletions.
diff --git a/docs/source/backends/hdf5.rst b/docs/source/backends/hdf5.rst
@@ -65,6 +65,7 @@ Any file object greater than or equal in size to threshold bytes will be aligned
 
 ``OPENPMD_HDF5_CHUNKS``: this sets defaults for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__.
 Chunking generally improves performance and only needs to be disabled in corner-cases, e.g. when heavily relying on independent, parallel I/O that non-collectively declares data records.
+The chunk size can alternatively (or additionally) be specified explicitly per dataset, by specifying a dataset-specific chunk size in the JSON/TOML configuration of ``resetDataset()``/``reset_dataset()``.
 
 ``OPENPMD_HDF5_COLLECTIVE_METADATA``: this is an option to enable collective MPI calls for HDF5 metadata operations via `H5Pset_all_coll_metadata_ops <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllCollMetadataOps>`__ and `H5Pset_coll_metadata_write <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetCollMetadataWrite>`__.
 By default, this optimization is enabled as it has proven to provide performance improvements.

diff --git a/docs/source/details/backendconfig.rst b/docs/source/details/backendconfig.rst
@@ -183,12 +183,15 @@ A full configuration of the HDF5 backend:
 .. literalinclude:: hdf5.json
    :language: json
 
-All keys found under ``hdf5.dataset`` are applicable globally (future: as well as per dataset).
+All keys found under ``hdf5.dataset`` are applicable globally as well as per dataset.
 Explanation of the single keys:
 
 * ``hdf5.dataset.chunks``: This key contains options for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__.
   The default is ``"auto"`` for a heuristic.
   ``"none"`` can be used to disable chunking.
+
+  An explicit chunk size can be specified as a list of positive integers, e.g. ``hdf5.dataset.chunks = [10, 100]``. Note that this specification should only be used per-dataset, e.g. in ``resetDataset()``/``reset_dataset()``.
+
   Chunking generally improves performance and only needs to be disabled in corner-cases, e.g. when heavily relying on independent, parallel I/O that non-collectively declares data records.
 * ``hdf5.vfd.type`` selects the HDF5 virtual file driver.
   Currently available are:

diff --git a/examples/5_write_parallel.cpp b/examples/5_write_parallel.cpp
@@ -54,6 +54,9 @@ type = "subfiling"
 ioc_selection = "every_nth_rank"
 stripe_size = 33554432
 stripe_count = -1
+
+[hdf5.dataset]
+chunks = "auto"
         )";
 
     // open file for writing
@@ -81,7 +84,10 @@ stripe_count = -1
     // example 1D domain decomposition in first index
     Datatype datatype = determineDatatype<float>();
     Extent global_extent = {10ul * mpi_size, 300};
-    Dataset dataset = Dataset(datatype, global_extent);
+    Dataset dataset = Dataset(datatype, global_extent, R"(
+[hdf5.dataset]
+chunks = [10, 100]
+    )");
 
     if (0 == mpi_rank)
         cout << "Prepared a Dataset of size " << dataset.extent[0] << "x"

diff --git a/include/openPMD/IO/HDF5/HDF5IOHandlerImpl.hpp b/include/openPMD/IO/HDF5/HDF5IOHandlerImpl.hpp
@@ -118,9 +118,9 @@ class HDF5IOHandlerImpl : public AbstractIOHandlerImpl
 #endif
 
     json::TracingJSON m_config;
+    std::optional<nlohmann::json> m_buffered_dataset_config;
 
 private:
-    std::string m_chunks = "auto";
     struct File
     {
         std::string name;

diff --git a/include/openPMD/auxiliary/JSON_internal.hpp b/include/openPMD/auxiliary/JSON_internal.hpp
@@ -91,6 +91,7 @@ namespace json
          * @return nlohmann::json const&
          */
         nlohmann::json const &getShadow() const;
+        nlohmann::json &getShadow();
 
         /**
          * @brief Invert the "shadow", i.e. a copy of the original JSON value
@@ -247,5 +248,8 @@ namespace json
      */
     nlohmann::json &
     merge(nlohmann::json &defaultVal, nlohmann::json const &overwrite);
+
+    nlohmann::json &filterByTemplate(
+        nlohmann::json &defaultVal, nlohmann::json const &positiveMask);
 } // namespace json
 } // namespace openPMD