-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export data from katdal to zarr #315
Conversation
@kvanqa and I tried this with an actual kadtal mv4 rdb link and it just worked! |
Here is an example with Based on two conversions, we're seeing a compression ratio of 30% for the zarr dataset, compared to the MS. The below zarr output is 11.9GB on disk vs an estimated 17.4GB for example. $ poetry run py.test -s -vvv -k test_chunkstore daskms
not stress and not optional and not applications
============================================================ test session starts =============================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.4.0 -- /home/simon/.cache/pypoetry/virtualenvs/dask-ms-jCyuTJVk-py3.10/bin/python
cachedir: .pytest_cache
rootdir: /home/simon/code/dask-ms
collected 324 items / 322 deselected / 2 skipped / 2 selected
daskms/experimental/katdal/tests/test_chunkstore.py::test_chunkstore[output.zarr-True-False] <xarray.Dataset> Size: 17GB
Dimensions: (row: 59508, uvw: 3, chan: 4096, corr: 4)
Dimensions without coordinates: row, uvw, chan, corr
Data variables: (12/20)
TIME (row) float64 476kB dask.array<chunksize=(1653,), meta=np.ndarray>
ANTENNA1 (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
ANTENNA2 (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
FEED1 (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
FEED2 (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
DATA_DESC_ID (row) int32 238kB dask.array<chunksize=(1653,), meta=np.ndarray>
... ...
EXPOSURE (row) float64 476kB dask.array<chunksize=(1653,), meta=np.ndarray>
UVW (row, uvw) float64 1MB dask.array<chunksize=(1653, 3), meta=np.ndarray>
DATA (row, chan, corr) complex64 8GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
FLAG (row, chan, corr) bool 975MB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
WEIGHT_SPECTRUM (row, chan, corr) float32 4GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
SIGMA_SPECTRUM (row, chan, corr) float32 4GB dask.array<chunksize=(1653, 256, 4), meta=np.ndarray>
PASSED
daskms/experimental/katdal/tests/test_chunkstore.py::test_chunkstore[output.zarr-False-False] <xarray.Dataset> Size: 17GB
Dimensions: (time: 36, baseline: 1653, uvw: 3, chan: 4096, corr: 4)
Dimensions without coordinates: time, baseline, uvw, chan, corr
Data variables: (12/20)
TIME (time, baseline) float64 476kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
ANTENNA1 (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
ANTENNA2 (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
FEED1 (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
FEED2 (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
DATA_DESC_ID (time, baseline) int32 238kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
... ...
EXPOSURE (time, baseline) float64 476kB dask.array<chunksize=(1, 1653), meta=np.ndarray>
UVW (time, baseline, uvw) float64 1MB dask.array<chunksize=(1, 1653, 3), meta=np.ndarray>
DATA (time, baseline, chan, corr) complex64 8GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
FLAG (time, baseline, chan, corr) bool 975MB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
WEIGHT_SPECTRUM (time, baseline, chan, corr) float32 4GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
SIGMA_SPECTRUM (time, baseline, chan, corr) float32 4GB dask.array<chunksize=(1, 1653, 256, 4), meta=np.ndarray>
PASSED |
We tested that application of L1 calibrations worked to first order (basic apply worked) |
@ludwigschwardt I did the work here because it seemed more convenient, but we can certainly discuss moving it to other locations if that makes sense. |
@david-macmahon might find this interesting too |
13f817c
to
b6b7e2a
Compare
b73a915
to
55cb265
Compare
tricolour, Quartical and pfb-clean all appear to partition Measurement Sets as follows: I'm going to reproduce this partitioning logic in this export and just want to sanity check this before I do so. @bennahugo @JSKenyon @landmanbester Let me know if there's another desirable partition (I can't think of one offhand). |
I think that |
It's much easier to concatenate by scan than to split back into scans though isn't it? |
Actually, the code currently generates datasets per scan, field and spw (DDID really), but concatenates them all to from one large dataset. I did this for exploratory purposes. I'll go with |
…appropriate locatoins
Once this PR is merged, the following should be possible to export a SARAO archive link directly to zarr format, without first converting to MSv2 via $ pip install dask-ms[katdal]
$ dask-ms katdal import <rdb-link> |
Capture Block 1711437619 (90MB small) is useful for a quick sanity check at the BRP. |
@ludwigschwardt you might find this interesting
Tests added / passed
If the pep8 tests fail, the quickest way to correct
this is to run
autopep8
and thenflake8
andpycodestyle
to fix the remaining issues.Fully documented, including
HISTORY.rst
for all changesand one of the
docs/*-api.rst
files for new APITo build the docs locally: