Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish new standard of compressed netCDF output, retaining datetime64[ns] format for time #63

Closed
ryjombari opened this issue Feb 18, 2025 · 8 comments
Assignees
Labels
enhancement New feature or request HMB-gen

Comments

@ryjombari
Copy link
Collaborator

Our example compressed netCDF format was tested by multiple people at NOAA / NCEI. The only issue they had with the compressed file was that the format for time changed from datetime64[ns] in the uncompressed to this in the compressed file:

time
Size: 1440x1
Dimensions: time
Datatype: int64
Attributes:
units = "minutes since 2022-07-21 00:00:00"
calendar = "proleptic_gregorian"

Can we:

  1. modify the compression routine to keep datetime64[ns] format for time in the compressed files, and
  2. make compressed netCDF our new standard for output from pbp?
@danellecline
Copy link
Collaborator

@ryjombari and I revisited this.

The compressed file as read with python does not report a change in the formatting of the time

import xarray as xr
chc = xr.open_dataset('CH01_20220721_quality_flag_compressed.nc')
time=chc["time"]
time.coords
Coordinates:
  * time     (time) datetime64[ns] 12kB 2022-07-21 ... 2022-07-21T23:59:00

Version of the libraries we are using

xarray==2025.1.2
numpy==1.26.4
netCDF4==1.7.2

@danellecline
Copy link
Collaborator

Here are the two files we tested - both the raw and the compressed versions: Archive.zip

@ryjombari
Copy link
Collaborator Author

ryjombari commented Mar 6, 2025

Update –

Danelle and I confirmed that the compressed netCDF file retains the datetime64[ns] time format, and that the compressed netCDF file provided to NCEI for testing actually held time as datetime64[ns].

So, the issue reported by a tester must have been due to the software they used to read the netCDF file. (They would have had the same issue with the uncompressed netCDF.) Here is what they wrote:

"I did have to make a change to accommodate the datetime formatting (minutes since XXX), but other than that they seem to load fine! Data from the compressed and uncompressed NetCDF files are identical after I read them in, so the compression seems like a great idea."

@danellecline
Copy link
Collaborator

@carueda OK to move forward with adding the compressed versions in PBP. Here is a code snippet that was used to write the compressed version.

def write_compressed_netcdf(ds: xr.Dataset, out_file: Path) -> None:
    enc = {}
    for k in ds.data_vars:
        if ds[k].ndim < 2:
            continue
        enc[k] = {
            "zlib": True,
            "complevel": 3,
            "fletcher32": True,
            "chunksizes": tuple(map(lambda x: x // 2, ds[k].shape))
        }
    ds.to_netcdf(out_file, format="NETCDF4", engine="h5netcdf", encoding=enc)

@carueda
Copy link
Member

carueda commented Mar 7, 2025

Thanks @danellecline @ryjombari :

So, in conclusion:

  • by default, HMB gen will generate the netcdf with compression
  • we add a new flag (say --no-netcdf-compression for the CLI, and corresponding parameter for the API), to allow the user to opt out of compression

Correct?

@danellecline
Copy link
Collaborator

danellecline commented Mar 7, 2025

@carueda

we add a new flag (say --no-netcdf-compression for the CLI, and corresponding parameter for the API), to allow the user to opt out of compression

IMHO that is up to @ryjombari .

I think it's safe to conclude that multiple readers can read the compressed format, but I can see why keeping the option to save uncompress could be helpful as well for backwards compatibility.

@danellecline danellecline added enhancement New feature or request HMB-gen labels Mar 7, 2025
@ryjombari
Copy link
Collaborator Author

@carueda I think you made the right call on this first pass:

  • compress by default
  • allow the user to opt out

Standing by to test on gizo whenever we are ready... and crank out the most recent deployments of MB05 and CH01.

@carueda
Copy link
Member

carueda commented Mar 8, 2025

Ok, I've merged #65, which added the NetCDF compression plus the CLI option --no-netcdf-compression). Then, I also added API mechanisms for the same effect and updated the documentation for the CLI at https://docs.mbari.org/pbp/pbp-hmb-gen/. With that, I decided to publish a new release: https://pypi.org/project/mbari-pbp/.

Some notes:

  • Though we have some other items in the works, went ahead with a release to verify it works as usual
  • The CHANGELOG.md in the repo gives details about how to programmatically disable compression. It is still a TODO to describe the API in general at https://docs.mbari.org/pbp/

@carueda carueda closed this as completed Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HMB-gen
Projects
None yet
Development

No branches or pull requests

3 participants