-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output Struct Overhaul #445
base: v4-prep
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v4-prep #445 +/- ##
===========================================
- Coverage 79.56% 76.88% -2.69%
===========================================
Files 24 27 +3
Lines 3803 3747 -56
Branches 647 611 -36
===========================================
- Hits 3026 2881 -145
- Misses 558 648 +90
+ Partials 219 218 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I had a few questions and minor points but this looks like a huge improvement
src/py21cmfast/drivers/coeval.py
Outdated
pf = pf2 | ||
_bt = None | ||
hb = hb2 | ||
st = st2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we separating the node redshifts from the output redshifts here? previously we only created the coevals on the outputs and only updated the previous snapshot on the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I think I may have slightly broken this. The idea is still to only evolve based on the node redshifts, but to yield on every redshift (either out_redshift
or node_redshift
). Currently, it looks like I might be evolving on everything, so I should check and fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had to fix this. I just put an if z in inputs.node_redshifts:
check in before updating the "current" boxes.
v | ||
for k, v in outputs.items() | ||
if not k.startswith("previous_") and not k.startswith("descendant_") | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't got to the single fields yet, but I'm curious how this works with the XraySourceBox, which needs the whole HaloBox history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd appreciate a closer look at that, as it's not something I'm as familiar with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like at the moment it won't check the HaloBox redshifts, since it is passed as a list(OutputStruct) instead of an OutputStruct. One could add a third option here for a history_
or similar which takes a list and needs one entry for each progenitor z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the way I dealt with this is to get two lists of output structs passed to the function:
- All output structs that are top-level
- All output structs that are either top-level OR exist inside a list of such structs (by setting
recurse=True
).
The first list gets checked for redshift compatibility, while the second list gets checked for parameter compatibility etc.
So, there are no checks that the list of boxes follow the z-grid specified. We could add that as well later if we want.
if descendant_halos is None: | ||
descendant_halos = HaloField( | ||
descendant_halos = HaloField.new( | ||
redshift=0.0, | ||
inputs=inputs, | ||
dummy=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me that in the backend, the sampling from grid/descendants is controlled by the redshift of this object being <=0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if having a .dummy()
constructor method would be neater (auto-setting the reshift to say -1 and setting dummy=True).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this.
src/py21cmfast/drivers/coeval.py
Outdated
elif perturbed_field: | ||
inputs = perturbed_field[0].inputs | ||
|
||
if not out_redshifts and not perturbed_field and not inputs.node_redshifts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we want to happen when out_redshifts
is None and we have some inputs with node redshifts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behaviour (and I'm happy to discuss this) is to yield on all node redshifts and out_redshifts
. So as long as at least one of them is non-empty, everything is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to output either list of out_redshifts
Coeval objects with the yield, or a boolean is_in_output
flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done the latter, so that memory is reduced. In the generate_coeval
function, two things are now returned: the Coeval
and a bool representing if the redshift is in out_redshifts
. From run_ceoval
, only the final list of coevals is returned.
|
||
@classmethod | ||
def new(cls, x: dict | InputStruct | None = None, **kwargs): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should update this docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it modestly -- is that what you were thinking?
default_input_struct.check_output_compatibility([example_ib]) | ||
|
||
default_input_struct.check_output_compatibility([perturbed_field]) | ||
# def test_inputstruct_outputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to rewrite this test for to test the compatibility checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea. I can't remember what all I've covered in tests now, but will have another think tomorrow.
While one of the current failing tests (the macros 3.12) is the same issue of workflows losing one of the temp directories. There's a GSL error in the Ubuntu 3.12 which I haven't seen before. Nikos Found something similar when running the database, I'm curious what's causing this since sometimes just rerunning makes it work again. I don't think it has much to do with this PR but we should look in to it |
|
||
if lib.photon_cons_allocated: | ||
lib.FreePhotonConsMemory() | ||
|
||
|
||
def run_coeval(**kwargs) -> list[Coeval]: # noqa: D103 | ||
return [coeval for coeval, in_nodes in generate_coeval(**kwargs) if in_nodes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is in_outputs rather than in_nodes?
# the last one. | ||
minimum_node = len(inputs.node_redshifts) - 1 | ||
|
||
if minimum_node < 0 or inputs.flag_options.USE_HALO_FIELD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be apparent later, but why are we starting at the beginning every time with the halos? Is it so the HaloBox array passed to XraySourcebox always gets populated? It may be a good idea (not necessarily in this PR) to extend this function to run through all three redshift loops (perturbed field, halos, forward loop) to find where we stopped last time and gather all the necessary OutputStructs
# def is_partial(self): | ||
# """Whether the cache is complete down to some redshift, but not the last z.""" | ||
# z, idx = self.get_completed_redshift() | ||
# return idx == len(self.inputs.node_redshifts) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this anymore?
def initial(cls, inputs: InputParameters = InputParameters(random_seed=1)): | ||
"""Create a dummy instance with the given inputs.""" | ||
return cls.new(inputs=inputs, redshift=-1.0, initial=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to double-check that not having the same InputParameters won't mess anything up, either in the parameter checks or the backend. Since all the calls to these constructors are without the input kwarg I'm guessing it doesn't matter, but in that case should we just force the input parameters to be InputParameters(random_seed=1)
and make sure that they are never called/used?
@@ -260,7 +274,7 @@ def rectlcn( | |||
|
|||
@pytest.fixture(scope="session") | |||
def lc(rectlcn, ic, cache, default_input_struct_lc): | |||
*_, lc = exhaust_lightcone( | |||
*_, lc = run_lightcone( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know you could unpack like this, cool
Summary
This changes the output structure interface to be more simple and streamlined.
It is quite a comprehensive set of changes that touch a lot of things on the python-side. I'll try to list as many as I can here for easy reference:
Arrays and Backend mapping
arrays.py
module that implements anArray
object. This object knows about the shape and dtype of an array, without necessarily having it instantiated, but also knows how to instantiate it, pass it to C, and keeps track of theArrayState
.OutputStructs
OutputStruct
is now anattrs
class. More importantly, all of the arrays that it needs to handle are defined directly on the class asArray
parameters, making it easier to track them..new()
classmethod that instantiates it from anInputParameters
object, getting the shape/dtype info (and which arrays need to be present) from theinputs
.Array
objects is that the attributes of theOutputStruct
are no longer numpy arrays, and so you can't do for examplenp.mean(ics.lowres_density)
any more. This is smoothed over a bit by newget()
and set()methods specifically for the arrays, so you can do
np.mean(ics.get('lowres_density'))`. This has the added advantage of transparently loading the array from disk if it exists there. Note that on a Coeval object, any field of any OutputStruct can be accessed directly via attribute name, as an array.OutputStruct
class, instead moving it to the newio
subpackage._compat_hash
attribute on eachOutputStruct
that tells it the level of input-hash required.Caching / IO of single-fields (OutputStruct)
io.caching
module implements classes/functions for dealing with the cache. I think this is a bit more intuitive than in previous versions.OutputCache
object has methods for introspecting a particular cache (defined by some directory the user gives at runtime) and reading/writing OutputStructs to it.RunCache
manages full runs (i.e. all boxes belonging to a full redshift-evolved simulation), allowing simple determination of which cache files are present, and which haven't yet been run (useful for checkpointing).CacheConfig
class simply defines a namespace for defining which boxes to write to cache during a larger run (coeval/lightcone).cache_tools
module has been removed as it is redundant with the above module.io/h5.py
, and so is separated from the OutputStruct class definitions themselves. This might facilitate implementing different cache formats in the future. The file format is also slightly different (I think it's slightly better now -- the format is specified in the docstring of the module, so you can check).Single-Field Computations
single_field
module is a lot simpler. I have moved most of the boiler-plate logic to a class-style decorator in_param_config
.Lightcone / Coeval
run_coeval
andrun_lightcone
into a set of external functions:evolve_perturb_halos
and_redshift_loop_generator
.Coeval
andLightcone
objects are much more slim now. I removed the ability to "gather" the cached files associated with a coeval/lc, instead relying on the improved caching module to let people deal with their full-run caches.Coeval.from_file
instead ofCoeval.read()
which I think is more intuitive.Configuration
CacheConfig
).Other Stuff
InputParameters
fromparam_config
toinputs
just because I was getting circular imports.Meta-info:
Issues Solved