Export data from katdal to zarr #315

sjperkins · 2024-03-25T08:30:26Z

@ludwigschwardt you might find this interesting

Tests added / passed
```
$ py.test -v -s daskms/tests
```
If the pep8 tests fail, the quickest way to correct
this is to run autopep8 and then flake8 and
pycodestyle to fix the remaining issues.
```
$ pip install -U autopep8 flake8 pycodestyle
$ autopep8 -r -i daskms
$ flake8 daskms
$ pycodestyle daskms
```
Fully documented, including HISTORY.rst for all changes
and one of the docs/*-api.rst files for new API

To build the docs locally:
```
pip install -r requirements.readthedocs.txt
cd docs
READTHEDOCS=True make html
```

sjperkins · 2024-03-25T09:13:18Z

@kvanqa and I tried this with an actual kadtal mv4 rdb link and it just worked!

sjperkins · 2024-03-25T09:22:25Z

Here is an example with https://archive-gw-1.kat.ac.za/1711249692/1711249692_sdp_l0.full.rdb. Interestingly xarray's estimate of that data size matches the archive estimate of the mvftoms.py MS size.

Based on two conversions, we're seeing a compression ratio of 30% for the zarr dataset, compared to the MS. The below zarr output is 11.9GB on disk vs an estimated 17.4GB for example.

$ poetry run py.test -s -vvv -k test_chunkstore daskms
not stress and not optional and not applications
============================================================ test session starts =============================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.4.0 -- /home/simon/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.10/bin/python
cachedir: .pytest_cache
rootdir: /home/simon/code/dask-ms
collected 324 items / 322 deselected / 2 skipped / 2 selected                                                                                

daskms/experimental/katdal/tests/test_chunkstore.py::test_chunkstore[output.zarr-True-False] <xarray.Dataset> Size: 17GB
Dimensions:          (row: 59508, uvw: 3, chan: 4096, corr: 4)
Dimensions without coordinates: row, uvw, chan, corr
Data variables: (12/20)
    TIME             (row) float64 476kB dask.array<chunksize=(1653,), meta=np.ndarray>
    ANTENNA1         (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    ANTENNA2         (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    FEED1            (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    FEED2            (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    DATA_DESC_ID     (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
    ...               ...
    EXPOSURE         (row) float64 476kB dask.array<chunksize=(1653,), meta=np.ndarray>
    UVW              (row, uvw) float64 1MB dask.array<chunksize=(1653, 3), meta=np.ndarray>
    DATA             (row, chan, corr) complex64 8GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
    FLAG             (row, chan, corr) bool 975MB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
    WEIGHT_SPECTRUM  (row, chan, corr) float32 4GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
    SIGMA_SPECTRUM   (row, chan, corr) float32 4GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
PASSED
daskms/experimental/katdal/tests/test_chunkstore.py::test_chunkstore[output.zarr-False-False] <xarray.Dataset> Size: 17GB
Dimensions:          (time: 36, baseline: 1653, uvw: 3, chan: 4096, corr: 4)
Dimensions without coordinates: time, baseline, uvw, chan, corr
Data variables: (12/20)
    TIME             (time, baseline) float64 476kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    ANTENNA1         (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    ANTENNA2         (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    FEED1            (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    FEED2            (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    DATA_DESC_ID     (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    ...               ...
    EXPOSURE         (time, baseline) float64 476kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
    UVW              (time, baseline, uvw) float64 1MB dask.array<chunksize=(1, 1653, 3), meta=np.ndarray>
    DATA             (time, baseline, chan, corr) complex64 8GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
    FLAG             (time, baseline, chan, corr) bool 975MB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
    WEIGHT_SPECTRUM  (time, baseline, chan, corr) float32 4GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
    SIGMA_SPECTRUM   (time, baseline, chan, corr) float32 4GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
PASSED

sjperkins · 2024-03-25T09:57:46Z

We tested that application of L1 calibrations worked to first order (basic apply worked)

sjperkins · 2024-03-25T09:58:35Z

@ludwigschwardt I did the work here because it seemed more convenient, but we can certainly discuss moving it to other locations if that makes sense.

sjperkins · 2024-03-25T10:33:23Z

@david-macmahon might find this interesting too

This reverts commit cb94bad.

sjperkins · 2024-03-27T08:57:18Z

tricolour, Quartical and pfb-clean all appear to partition Measurement Sets as follows:["FIELD_ID", "DATA_DESC_ID", "SCAN_NUMBER"].

I'm going to reproduce this partitioning logic in this export and just want to sanity check this before I do so.

@bennahugo @JSKenyon @landmanbester Let me know if there's another desirable partition (I can't think of one offhand).

JSKenyon · 2024-03-27T08:59:53Z

I think that SCAN_NUMBER isn't required by QuartiCal and may not be desirable for 1GC i.e. if you want to calibrate over all the scans simultaneously.

landmanbester · 2024-03-27T09:04:12Z

I think that SCAN_NUMBER isn't required by QuartiCal and may not be desirable for 1GC i.e. if you want to calibrate over all the scans simultaneously.

It's much easier to concatenate by scan than to split back into scans though isn't it?

sjperkins · 2024-03-27T09:17:34Z

I think that SCAN_NUMBER isn't required by QuartiCal and may not be desirable for 1GC i.e. if you want to calibrate over all the scans simultaneously.

It's much easier to concatenate by scan than to split back into scans though isn't it?

Actually, the code currently generates datasets per scan, field and spw (DDID really), but concatenates them all to from one large dataset. I did this for exploratory purposes.

I'll go with ["FIELD_ID", "DATA_DESC_ID", "SCAN_NUMBER"] for now.

…appropriate locatoins

sjperkins · 2024-03-28T13:23:40Z

Once this PR is merged, the following should be possible to export a SARAO archive link directly to zarr format, without first converting to MSv2 via mvftoms.py

$ pip install dask-ms[katdal]
$ dask-ms katdal import <rdb-link>

sjperkins · 2024-03-28T13:32:41Z

Capture Block 1711437619 (90MB small) is useful for a quick sanity check at the BRP.

Test case runs through

1534fed

Add a basic ANTENNA subtable export

62a159d

sjperkins added 2 commits March 25, 2024 12:00

Add katdal to complete install

32143c2

Small updates

e65bb34

Kamvalethu Vanqa and others added 7 commits March 25, 2024 12:49

dummy

cb94bad

Revert "dummy"

bba7634

This reverts commit cb94bad.

added dist_diameter

41db8d2

Include autocorrelations in dataset selection

3b1ac6b

Sanity check written visibilities, weights and flags

8c75d2f

Move test_transpose precondition

0cfdf55

Support dataset parametrization

b6b7e2a

sjperkins force-pushed the katdal-export branch from 13f817c to b6b7e2a Compare March 26, 2024 09:02

Kamvalethu Vanqa and others added 7 commits March 26, 2024 11:22

added the spectral window subtable

5928e13

Add test cases back

08c7b67

assert_transposed_equal

663a7d1

Refactor FakeDataset into conftest.py

ba10f6d

Move dataset generation into proxy class

2e3f779

skip imports

794a1f7

Fix imports

55cb265

sjperkins force-pushed the katdal-export branch from b73a915 to 55cb265 Compare March 26, 2024 16:17

Checkpoint write of main MSv2 subtables

8dd406e

sjperkins added 18 commits March 27, 2024 13:17

removed unused var

f3dc644

add missing pprint import

0f07159

MSV2Proxy to KatdalToXarrayMSv2Adapter

28e2b41

Clean up design pattern terminology

4b2ca23

Fix comment

503012b

STATE and OBSERVATION subtables

a0b8097

Fix up FIELD_ID and STATE_ID creation

5b773ff

Fixup spw np.newaxis usage

75ce076

Allow target parametrization in dataset fixture

a105245

simplify

e094bc7

Add dask-ms partitioning attributes

b7a1006

Document derivation of katdal code and include katdal LICENSE in the …

c8e4b48

…appropriate locatoins

Remove debug pprints

2825c2e

Add katdal_import stub

2e8eb0e

Merge branch 'master' into katdal-export

977679e

Merge branch 'master' into katdal-export

e0ecb9c

Add dask-ms katdal import click application

de4afcc

[skip ci] Update HISTORY.rst

f399a29

Move import guards

60f42bc

[skip ci] Update HISTORY.rst

b1f5c5b

sjperkins merged commit 32e866b into master Mar 28, 2024

sjperkins deleted the katdal-export branch March 28, 2024 13:38

sjperkins mentioned this pull request Apr 3, 2024

compare to kerchunk? ratt-ru/xarray-fits#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export data from katdal to zarr #315

Export data from katdal to zarr #315

sjperkins commented Mar 25, 2024 •

edited

Loading

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 27, 2024

JSKenyon commented Mar 27, 2024

landmanbester commented Mar 27, 2024

sjperkins commented Mar 27, 2024

sjperkins commented Mar 28, 2024

sjperkins commented Mar 28, 2024

Export data from katdal to zarr #315

Export data from katdal to zarr #315

Conversation

sjperkins commented Mar 25, 2024 • edited Loading

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 25, 2024

sjperkins commented Mar 27, 2024

JSKenyon commented Mar 27, 2024

landmanbester commented Mar 27, 2024

sjperkins commented Mar 27, 2024

sjperkins commented Mar 28, 2024

sjperkins commented Mar 28, 2024

sjperkins commented Mar 25, 2024 •

edited

Loading