Output Struct Overhaul #445

steven-murray · 2024-12-04T13:01:43Z

Summary

This changes the output structure interface to be more simple and streamlined.

It is quite a comprehensive set of changes that touch a lot of things on the python-side. I'll try to list as many as I can here for easy reference:

Arrays and Backend mapping

New arrays.py module that implements an Array object. This object knows about the shape and dtype of an array, without necessarily having it instantiated, but also knows how to instantiate it, pass it to C, and keeps track of the ArrayState.

OutputStructs

The OutputStruct is now an attrs class. More importantly, all of the arrays that it needs to handle are defined directly on the class as Array parameters, making it easier to track them.
Each output struct now has a .new() classmethod that instantiates it from an InputParameters object, getting the shape/dtype info (and which arrays need to be present) from the inputs.
The downside to the above way of managing the C/Python/Disk interface with Array objects is that the attributes of the OutputStruct are no longer numpy arrays, and so you can't do for example np.mean(ics.lowres_density) any more. This is smoothed over a bit by new get() and set()methods specifically for the arrays, so you can donp.mean(ics.get('lowres_density'))`. This has the added advantage of transparently loading the array from disk if it exists there. Note that on a Coeval object, any field of any OutputStruct can be accessed directly via attribute name, as an array.
I've also taken all the caching and I/O management out of the OutputStruct class, instead moving it to the new io subpackage.
There's a new _compat_hash attribute on each OutputStruct that tells it the level of input-hash required.

Caching / IO of single-fields (OutputStruct)

The new io.caching module implements classes/functions for dealing with the cache. I think this is a bit more intuitive than in previous versions.
The OutputCache object has methods for introspecting a particular cache (defined by some directory the user gives at runtime) and reading/writing OutputStructs to it.
The RunCache manages full runs (i.e. all boxes belonging to a full redshift-evolved simulation), allowing simple determination of which cache files are present, and which haven't yet been run (useful for checkpointing).
The CacheConfig class simply defines a namespace for defining which boxes to write to cache during a larger run (coeval/lightcone).
The cache_tools module has been removed as it is redundant with the above module.
All the reading/writing of HDF5 boxes has moved to io/h5.py, and so is separated from the OutputStruct class definitions themselves. This might facilitate implementing different cache formats in the future. The file format is also slightly different (I think it's slightly better now -- the format is specified in the docstring of the module, so you can check).
There is also a mechanism now for being able to read files written by older versions of the code, so we can maintain explicit backwards compatibility with older outputs.

Single-Field Computations

The single_field module is a lot simpler. I have moved most of the boiler-plate logic to a class-style decorator in _param_config.
This new decorator checks redshift consistency, input parameter consistency, manages the cache and sets the current redshift appropriately given all inputs.

Lightcone / Coeval

I refactored some re-used code in run_coeval and run_lightcone into a set of external functions: evolve_perturb_halos and _redshift_loop_generator.
The Coeval and Lightcone objects are much more slim now. I removed the ability to "gather" the cached files associated with a coeval/lc, instead relying on the improved caching module to let people deal with their full-run caches.
Also, to read a Coeval/Lightcone you do Coeval.from_fileinstead of Coeval.read() which I think is more intuitive.

Configuration

I actually think we should generally move away from package-wide configuration, because it always causes trouble. I haven't removed the module itself here because it's slightly outside the scope of the PR, but I did remove the "regenerate" and "write" configuration options, and removed all places where the config was used.
We will have to think about how to re-implement all the functionality we had in the config (e.g. number of sigfigs for the cache). Probably most of this can be put directly into new objects (like the CacheConfig).

Other Stuff

I've removed any documentation or caching references to "global params". These are now to be treated as almost purely read-only (and we should move towards them being completely removed soon).
I moved the definition of InputParameters from param_config to inputs just because I was getting circular imports.

Meta-info:

These changes break strict backwards-compatibility

Issues Solved

for more information, see https://pre-commit.ci

review-notebook-app · 2024-12-24T00:49:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2024-12-24T01:10:25Z

Codecov Report

Attention: Patch coverage is 78.69393% with 323 lines in your changes missing coverage. Please review.

Project coverage is 76.88%. Comparing base (5930245) to head (6b894c8).
Report is 3 commits behind head on v4-prep.

Files with missing lines	Patch %	Lines
src/py21cmfast/io/caching.py	54.43%	74 Missing and 3 partials ⚠️
src/py21cmfast/wrapper/outputs.py	82.63%	47 Missing and 19 partials ⚠️
src/py21cmfast/io/h5.py	71.52%	31 Missing and 12 partials ⚠️
src/py21cmfast/drivers/_param_config.py	82.94%	19 Missing and 10 partials ⚠️
src/py21cmfast/drivers/coeval.py	79.85%	21 Missing and 6 partials ⚠️
src/py21cmfast/wrapper/inputs.py	84.96%	15 Missing and 8 partials ⚠️
src/py21cmfast/wrapper/arrays.py	75.00%	9 Missing and 7 partials ⚠️
src/py21cmfast/drivers/lightcone.py	90.35%	5 Missing and 6 partials ⚠️
src/py21cmfast/drivers/single_field.py	85.33%	7 Missing and 4 partials ⚠️
src/py21cmfast/cli.py	55.55%	4 Missing ⚠️
... and 6 more

Additional details and impacted files

@@             Coverage Diff             @@
##           v4-prep     #445      +/-   ##
===========================================
- Coverage    79.56%   76.88%   -2.69%     
===========================================
  Files           24       27       +3     
  Lines         3803     3747      -56     
  Branches       647      611      -36     
===========================================
- Hits          3026     2881     -145     
- Misses         558      648      +90     
+ Partials       219      218       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

daviesje

Looks great! I had a few questions and minor points but this looks like a huge improvement

daviesje · 2024-12-19T00:22:21Z

src/py21cmfast/drivers/coeval.py

+        pf = pf2
+        _bt = None
+        hb = hb2
+        st = st2


How are we separating the node redshifts from the output redshifts here? previously we only created the coevals on the outputs and only updated the previous snapshot on the nodes.

Hmmm I think I may have slightly broken this. The idea is still to only evolve based on the node redshifts, but to yield on every redshift (either out_redshift or node_redshift). Currently, it looks like I might be evolving on everything, so I should check and fix that.

Yeah, I had to fix this. I just put an if z in inputs.node_redshifts: check in before updating the "current" boxes.

src/py21cmfast/drivers/_param_config.py

daviesje · 2024-12-29T23:52:53Z

src/py21cmfast/drivers/_param_config.py

+            v
+            for k, v in outputs.items()
+            if not k.startswith("previous_") and not k.startswith("descendant_")
+        ]:


I haven't got to the single fields yet, but I'm curious how this works with the XraySourceBox, which needs the whole HaloBox history

Yes, I'd appreciate a closer look at that, as it's not something I'm as familiar with.

It looks like at the moment it won't check the HaloBox redshifts, since it is passed as a list(OutputStruct) instead of an OutputStruct. One could add a third option here for a history_ or similar which takes a list and needs one entry for each progenitor z

OK so the way I dealt with this is to get two lists of output structs passed to the function:

All output structs that are top-level

All output structs that are either top-level OR exist inside a list of such structs (by setting recurse=True).

The first list gets checked for redshift compatibility, while the second list gets checked for parameter compatibility etc.
So, there are no checks that the list of boxes follow the z-grid specified. We could add that as well later if we want.

daviesje · 2024-12-30T00:01:03Z

src/py21cmfast/drivers/single_field.py

    if descendant_halos is None:
-        descendant_halos = HaloField(
+        descendant_halos = HaloField.new(
            redshift=0.0,
            inputs=inputs,
            dummy=True,
        )


This reminds me that in the backend, the sampling from grid/descendants is controlled by the redshift of this object being <=0.

I wonder if having a .dummy() constructor method would be neater (auto-setting the reshift to say -1 and setting dummy=True).

I implemented this.

daviesje · 2024-12-30T01:08:29Z

src/py21cmfast/drivers/coeval.py

+        elif perturbed_field:
+            inputs = perturbed_field[0].inputs
+
+    if not out_redshifts and not perturbed_field and not inputs.node_redshifts:


What do we want to happen when out_redshifts is None and we have some inputs with node redshifts?

The current behaviour (and I'm happy to discuss this) is to yield on all node redshifts and out_redshifts. So as long as at least one of them is non-empty, everything is fine.

Reminder to output either list of out_redshifts Coeval objects with the yield, or a boolean is_in_output flag

I've done the latter, so that memory is reduced. In the generate_coeval function, two things are now returned: the Coeval and a bool representing if the redshift is in out_redshifts. From run_ceoval, only the final list of coevals is returned.

daviesje · 2024-12-30T23:29:20Z

src/py21cmfast/wrapper/inputs.py

+
+    @classmethod
+    def new(cls, x: dict | InputStruct | None = None, **kwargs):
+        """


I should update this docstring

I updated it modestly -- is that what you were thinking?

src/py21cmfast/wrapper/inputs.py

src/py21cmfast/wrapper/outputs.py

daviesje · 2024-12-31T00:19:58Z

tests/test_input_structs.py

-        default_input_struct.check_output_compatibility([example_ib])
-
-    default_input_struct.check_output_compatibility([perturbed_field])
+# def test_inputstruct_outputs(


Do we want to rewrite this test for to test the compatibility checks?

Probably a good idea. I can't remember what all I've covered in tests now, but will have another think tomorrow.

daviesje · 2024-12-31T22:10:30Z

While one of the current failing tests (the macros 3.12) is the same issue of workflows losing one of the temp directories. There's a GSL error in the Ubuntu 3.12 which I haven't seen before. Nikos Found something similar when running the database, I'm curious what's causing this since sometimes just rerunning makes it work again. I don't think it has much to do with this PR but we should look in to it

src/py21cmfast/wrapper/outputs.py

daviesje · 2025-02-10T11:27:49Z

src/py21cmfast/drivers/coeval.py


    if lib.photon_cons_allocated:
        lib.FreePhotonConsMemory()


+def run_coeval(**kwargs) -> list[Coeval]:  # noqa: D103
+    return [coeval for coeval, in_nodes in generate_coeval(**kwargs) if in_nodes]


I think this is in_outputs rather than in_nodes?

daviesje · 2025-02-10T11:29:51Z

src/py21cmfast/drivers/coeval.py

+        # the last one.
+        minimum_node = len(inputs.node_redshifts) - 1
+
+    if minimum_node < 0 or inputs.flag_options.USE_HALO_FIELD:


This might be apparent later, but why are we starting at the beginning every time with the halos? Is it so the HaloBox array passed to XraySourcebox always gets populated? It may be a good idea (not necessarily in this PR) to extend this function to run through all three redshift loops (perturbed field, halos, forward loop) to find where we stopped last time and gather all the necessary OutputStructs

daviesje · 2025-02-10T13:02:33Z

src/py21cmfast/io/caching.py

+    # def is_partial(self):
+    #     """Whether the cache is complete down to some redshift, but not the last z."""
+    #     z, idx = self.get_completed_redshift()
+    #     return idx == len(self.inputs.node_redshifts) - 1


We don't need this anymore?

daviesje · 2025-02-10T13:10:21Z

src/py21cmfast/wrapper/outputs.py

+    def initial(cls, inputs: InputParameters = InputParameters(random_seed=1)):
+        """Create a dummy instance with the given inputs."""
+        return cls.new(inputs=inputs, redshift=-1.0, initial=True)
+


I just wanted to double-check that not having the same InputParameters won't mess anything up, either in the parameter checks or the backend. Since all the calls to these constructors are without the input kwarg I'm guessing it doesn't matter, but in that case should we just force the input parameters to be InputParameters(random_seed=1) and make sure that they are never called/used?

daviesje · 2025-02-10T13:12:21Z

tests/conftest.py

@@ -260,7 +274,7 @@ def rectlcn(

 @pytest.fixture(scope="session")
 def lc(rectlcn, ic, cache, default_input_struct_lc):
-    *_, lc = exhaust_lightcone(
+    *_, lc = run_lightcone(


I didn't know you could unpack like this, cool

steven-murray and others added 2 commits December 4, 2024 14:01

feat: half-way there

ca20357

[pre-commit.ci] auto fixes from pre-commit.com hooks

4addee1

for more information, see https://pre-commit.ci

steven-murray requested a review from daviesje December 14, 2024 00:29

feat: refactor of output structs

28bee8b

steven-murray marked this pull request as ready for review December 14, 2024 00:51

steven-murray added 5 commits December 14, 2024 01:51

merge main

adb74e2

test: make all tests work again

5d0b065

Merge branch 'v4-prep' into output-struct-overhaul

faf8494

fix: can't import lightcones because of circ dep

3b7f045

fix: typo in method update

5fd5117

docs: update coeval and lightcone tutorials

6b894c8

daviesje approved these changes Dec 31, 2024

View reviewed changes

daviesje reviewed Feb 5, 2025

View reviewed changes

src/py21cmfast/wrapper/outputs.py Outdated Show resolved Hide resolved

steven-murray added 6 commits February 6, 2025 18:16

fix: address James' comments

83bc7e6

fix: coeval updating only on node_redshifts

310e36e

fix: tests using node_redshifts and perturb_field in lightcone

5596da3

fix: bool return from generate_coevals is now connected to out_redshift

d5dd15b

refactor: both coeval and lightcone now use RunCache

17f087a

refactor: add .dummy() and .initial() constructors

c70ab5a

daviesje approved these changes Feb 10, 2025

View reviewed changes

steven-murray mentioned this pull request Feb 13, 2025

remove global_params #452

Merged

steven-murray added 3 commits February 14, 2025 11:54

merge v4-prep

a44dc22

fix: don't get halo list buffer size when it's a dummy

c4373a7

test: no test of setting direc in config

e9c9ca0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Struct Overhaul #445

Output Struct Overhaul #445

steven-murray commented Dec 4, 2024 •

edited

Loading

review-notebook-app bot commented Dec 24, 2024

codecov bot commented Dec 24, 2024

daviesje left a comment

daviesje Dec 19, 2024

steven-murray Jan 9, 2025

steven-murray Feb 6, 2025

daviesje Dec 29, 2024

steven-murray Jan 9, 2025

daviesje Feb 3, 2025

steven-murray Feb 6, 2025

daviesje Dec 30, 2024

steven-murray Jan 9, 2025

steven-murray Feb 7, 2025

daviesje Dec 30, 2024

steven-murray Jan 9, 2025

daviesje Feb 5, 2025

steven-murray Feb 7, 2025

daviesje Dec 30, 2024

steven-murray Jan 9, 2025

daviesje Dec 31, 2024

steven-murray Jan 9, 2025

daviesje commented Dec 31, 2024 •

edited

Loading

daviesje Feb 10, 2025

daviesje Feb 10, 2025

daviesje Feb 10, 2025

daviesje Feb 10, 2025

daviesje Feb 10, 2025

Output Struct Overhaul #445

Are you sure you want to change the base?

Output Struct Overhaul #445

Conversation

steven-murray commented Dec 4, 2024 • edited Loading

Arrays and Backend mapping

OutputStructs

Caching / IO of single-fields (OutputStruct)

Single-Field Computations

Lightcone / Coeval

Configuration

Other Stuff

review-notebook-app bot commented Dec 24, 2024

codecov bot commented Dec 24, 2024

Codecov Report

daviesje left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviesje commented Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steven-murray commented Dec 4, 2024 •

edited

Loading

daviesje commented Dec 31, 2024 •

edited

Loading