Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export data from katdal to zarr #315

Merged
merged 39 commits into from
Mar 28, 2024
Merged

Export data from katdal to zarr #315

merged 39 commits into from
Mar 28, 2024

Conversation

sjperkins
Copy link
Member

@sjperkins sjperkins commented Mar 25, 2024

@ludwigschwardt you might find this interesting

  • Tests added / passed

    $ py.test -v -s daskms/tests

    If the pep8 tests fail, the quickest way to correct
    this is to run autopep8 and then flake8 and
    pycodestyle to fix the remaining issues.

    $ pip install -U autopep8 flake8 pycodestyle
    $ autopep8 -r -i daskms
    $ flake8 daskms
    $ pycodestyle daskms
    
  • Fully documented, including HISTORY.rst for all changes
    and one of the docs/*-api.rst files for new API

    To build the docs locally:

    pip install -r requirements.readthedocs.txt
    cd docs
    READTHEDOCS=True make html
    

@sjperkins
Copy link
Member Author

@kvanqa and I tried this with an actual kadtal mv4 rdb link and it just worked!

@sjperkins
Copy link
Member Author

Here is an example with https://archive-gw-1.kat.ac.za/1711249692/1711249692_sdp_l0.full.rdb. Interestingly xarray's estimate of that data size matches the archive estimate of the mvftoms.py MS size.

Based on two conversions, we're seeing a compression ratio of 30% for the zarr dataset, compared to the MS. The below zarr output is 11.9GB on disk vs an estimated 17.4GB for example.

$ poetry run py.test -s -vvv -k test_chunkstore daskms
not stress and not optional and not applications
============================================================ test session starts =============================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.4.0 -- /home/simon/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.10/bin/python
cachedir: .pytest_cache
rootdir: /home/simon/code/dask-ms
collected 324 items / 322 deselected / 2 skipped / 2 selected                                                                                

daskms/experimental/katdal/tests/test_chunkstore.py::test_chunkstore[output.zarr-True-False] <xarray.Dataset> Size: 17GB
Dimensions:          (row: 59508, uvw: 3, chan: 4096, corr: 4)
Dimensions without coordinates: row, uvw, chan, corr
Data variables: (12/20)
    TIME             (row) float64 476kB dask.array<chunksize=(1653,), meta=np.ndarray>
    ANTENNA1         (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    ANTENNA2         (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    FEED1            (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    FEED2            (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    DATA_DESC_ID     (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    ...               ...
    EXPOSURE         (row) float64 476kB dask.array<chunksize=(1653,), meta=np.ndarray>
    UVW              (row, uvw) float64 1MB dask.array<chunksize=(1653, 3), meta=np.ndarray>
    DATA             (row, chan, corr) complex64 8GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
    FLAG             (row, chan, corr) bool 975MB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
    WEIGHT_SPECTRUM  (row, chan, corr) float32 4GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
    SIGMA_SPECTRUM   (row, chan, corr) float32 4GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
PASSED
daskms/experimental/katdal/tests/test_chunkstore.py::test_chunkstore[output.zarr-False-False] <xarray.Dataset> Size: 17GB
Dimensions:          (time: 36, baseline: 1653, uvw: 3, chan: 4096, corr: 4)
Dimensions without coordinates: time, baseline, uvw, chan, corr
Data variables: (12/20)
    TIME             (time, baseline) float64 476kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    ANTENNA1         (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    ANTENNA2         (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    FEED1            (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    FEED2            (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    DATA_DESC_ID     (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    ...               ...
    EXPOSURE         (time, baseline) float64 476kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    UVW              (time, baseline, uvw) float64 1MB dask.array<chunksize=(1, 1653, 3), meta=np.ndarray>
    DATA             (time, baseline, chan, corr) complex64 8GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
    FLAG             (time, baseline, chan, corr) bool 975MB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
    WEIGHT_SPECTRUM  (time, baseline, chan, corr) float32 4GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
    SIGMA_SPECTRUM   (time, baseline, chan, corr) float32 4GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
PASSED

@sjperkins
Copy link
Member Author

We tested that application of L1 calibrations worked to first order (basic apply worked)

@sjperkins
Copy link
Member Author

@ludwigschwardt I did the work here because it seemed more convenient, but we can certainly discuss moving it to other locations if that makes sense.

@sjperkins
Copy link
Member Author

@david-macmahon might find this interesting too

@sjperkins
Copy link
Member Author

tricolour, Quartical and pfb-clean all appear to partition Measurement Sets as follows:["FIELD_ID", "DATA_DESC_ID", "SCAN_NUMBER"].

I'm going to reproduce this partitioning logic in this export and just want to sanity check this before I do so.

@bennahugo @JSKenyon @landmanbester Let me know if there's another desirable partition (I can't think of one offhand).

@JSKenyon
Copy link
Collaborator

I think that SCAN_NUMBER isn't required by QuartiCal and may not be desirable for 1GC i.e. if you want to calibrate over all the scans simultaneously.

@landmanbester
Copy link
Collaborator

I think that SCAN_NUMBER isn't required by QuartiCal and may not be desirable for 1GC i.e. if you want to calibrate over all the scans simultaneously.

It's much easier to concatenate by scan than to split back into scans though isn't it?

@sjperkins
Copy link
Member Author

I think that SCAN_NUMBER isn't required by QuartiCal and may not be desirable for 1GC i.e. if you want to calibrate over all the scans simultaneously.

It's much easier to concatenate by scan than to split back into scans though isn't it?

Actually, the code currently generates datasets per scan, field and spw (DDID really), but concatenates them all to from one large dataset. I did this for exploratory purposes.

I'll go with ["FIELD_ID", "DATA_DESC_ID", "SCAN_NUMBER"] for now.

@sjperkins
Copy link
Member Author

Once this PR is merged, the following should be possible to export a SARAO archive link directly to zarr format, without first converting to MSv2 via mvftoms.py

$ pip install dask-ms[katdal]
$ dask-ms katdal import <rdb-link>

@sjperkins
Copy link
Member Author

Capture Block 1711437619 (90MB small) is useful for a quick sanity check at the BRP.

@sjperkins sjperkins merged commit 32e866b into master Mar 28, 2024
@sjperkins sjperkins deleted the katdal-export branch March 28, 2024 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants