-
Notifications
You must be signed in to change notification settings - Fork 39
Extend refspec support to [path] entries (without offset/length) #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref = ZarrToZarr(url).translate()
When I saw this line I thought your PR was going to add another possible filetype to open_virtual_dataset
that called kerchunk.zarr.ZarrToZarr
. My response was going to be that I actually don't want to add any further dependency on kerchunk because we're trying to depend on kerchunk as little as possible (see #87).
But it seems your PR doesn't do that yet, instead it adds a useful little workaround that I didn't know was possible! Amazing.
To support reading Zarr v2 with open_virtual_dataset
I would prefer that (in a follow-up PR) we wrote code to manually parse the contents of a zarr v2 store ourselves, using the new ManifestArray.from_array
constructor for maximum efficiency.
path = path_and_byte_range_info[0] | ||
offset = 0 | ||
length = UPath(path).stat().st_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trick might be really useful in general, because it seems kerchunk regularly returns a single string path instead of a tuple here. See for example #186 (comment)
def from_kerchunk( | ||
cls, path_and_byte_range_info: tuple[str] | tuple[str, int, int] | ||
) -> "ChunkEntry": | ||
if len(path_and_byte_range_info) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be an isinstance(..., str)
check instead perhaps? Kerchunk's docs on its specification indicate that it should return either a bare str
or a tuple[str, int, int]
. We can't predict when kerchunk will randomly decide to return the former, but we should be as explicit as possible when checking to see if that is what has happened. (I'm worried that there may be other situations in which kerchunk returns a length-1 something that isn't actually a string...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, excellent catch! Kerchunk doesn't like to inline Zarr variable data, effectively masking the case you point out.
I just lazily turned it into a NotImplementedError
for now since it would require a bit more thought. (Probably best to follow up in a subsequent PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, when does the inline case come up here exactly? I feel like it is also relevant for #119?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a simple reproducer:
import kerchunk.zarr
import xarray as xr
ds = xr.Dataset({}, coords={"time": [0, 1, 2]})
ds.to_zarr("/tmp/ds.zarr", mode="w")
ref = kerchunk.zarr.ZarrToZarr("/tmp/ds.zarr").translate()
ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0']
I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.
That doesn't surprise me - we already had to deal with fsspec/kerchunk#445.
But we actually don't use kerchunk's inlining feature at all in virtualizarr - instead if we want eager access to some variable we just load it as a numpy array-backed xarray variable instead (
https://virtualizarr.readthedocs.io/en/latest/usage.html#loading-variables). So we just set inline_threshold=0
whenever we call any kerchunk reader internally.
But I'm a little confused - that's all separate from how kerchunk will sometimes return a str
instead of tuple[str, int, int]
. Your example ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0']
is not an example of inlining, it's just kerchunk using a different way to implicitly represent the byte range reference to an entire chunk.
Also as I said above I don't think we should call kerchunk.zarr.ZarrToZarr
inside virtualizarr.open_virtual_dataset
, we should rewrite that function ourselves (but not in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I gave you an example for the wrong case.
I don't actually have any particular case in mind where data is inlined. I was assuming that inlined data is possible because kerchunk does it. Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.
If the only intended use of kerchunk is internal, should I remove that check? Or at least reword the text so that it's not a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.
Currently the only intended use case of kerchunk is internal, but this case:
Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.
we would like to support, which is being (intermittently) worked on in #119 and recently #186.
I think that the minimal version of this PR does not need to handle kerchunk-inlined data though, and we should merge this useful PR before worrying about that.
virtualizarr/manifests/manifest.py
Outdated
@@ -285,7 +299,7 @@ def to_zarr_json(self, filepath: str) -> None: | |||
@classmethod | |||
def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest": | |||
chunkentries = { | |||
cast(ChunkKey, k): ChunkEntry.from_kerchunk(v).dict() | |||
cast(ChunkKey, k): ChunkEntry.from_kerchunk(tuple(v)).dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you don't add this tuple()
call then my suggestion of using isinstance(..., str)
above will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, with my last commit the validation takes care of this.
class ChunkDictEntry(TypedDict): | ||
path: str | ||
offset: int | ||
length: int | ||
|
||
|
||
ChunkDict = NewType("ChunkDict", dict[ChunkKey, ChunkDictEntry]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice and probably what I should have used in the first place if I was better with python typing 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should do a similar change for any of the other types in the codebase...
@TomNicholas, I think I'm ready for another round of review, whenever you are. |
return ChunkManifest(entries=cast(ChunkDict, chunkentries)) | ||
def _from_kerchunk_chunk_dict( | ||
cls, | ||
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]], | |
kerchunk_chunk_dict: dict[ChunkKey, str | tuple[str] | tuple[str, int, int]], |
Pretty sure in a recent enough version of python you can just use the built-in dict
for typing, same as how you're using tuple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this looks like a mistake, but strangely it's necessary. I added an explanatory comment to prevent others from attempting this obvious change.
for k, v in kerchunk_chunk_dict.items(): | ||
if isinstance(v, (str, bytes)): | ||
raise NotImplementedError("TODO: handle inlined data") | ||
elif not isinstance(v, (tuple, list)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this incorrect typing? You allow v
to be a list
, then pass the list into a method which only accepts tuple
s (i.e. ChunkEntry.from_kerchunk()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pulling some really sketchy tricks with typing here.
You allow
v
to be alist
Actually, according to my type hint for kerchunk_chunk_dict
, v
should be a str
or tuple
but never a list
. Thus checking if v
is a list should be vacuous but not technically incorrect.
My rationale is that kerchunk can and probably will give us weird types, and we should be robust with handling them at runtime. While ChunkEntry.from_kerchunk
expects tuple
s it can still handle lists if need be, even though I don't tell the typing system. Thus I simply don't want to fail in case a list makes its way to this point.
So basically what I'm saying is, "I want to raise a type error if I don't get a tuple
, but if I happen to get a list
I'll continue, pretending as if it were a tuple."
Do you find this an acceptable strategy, or would you like to change it?
Ideally we'd add types to kerchunk, but I don't have time for that right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! So it's intentional.
I think this is fine - it still deals fine with the types the hints claim it will, plus you're saying that it will also silently succeed on some other types too.
Ideally we'd add types to kerchunk, but I don't have time for that right now.
Yep. In the long term the plan is to re-implement kerchunk's readers without going via the intermediate kerchunk reference format, at which point we will hopefully be able to remove all the code in kerchunk.py
anyway.
Sorry for the slow review @maresb ! Just some very small comments here. |
No worries, no rush, and thanks so much for maintaining!!! |
I think I'm all caught up with your comments. I don't think you're asking for a change regarding the Please let me know if you'd like any further changes, or if you'd like me to rebase. Thanks @TomNicholas! |
I think this looks good, thank you @maresb ! |
I'm trying to open a Zarr v2. I couldn't find a good obvious way to do this, so my attempt was:
This fails here in unpacking because kerchunk produces reference entries of the form
[path]
instead of[path, length, offset]
.I was wondering if I could work around this by
stat
ing the chunks, and sure enough it worked! (See my second commit.)I'd love to get some feedback on this, including if there's a better way to open v2 zarrs.
In preparation for my second commit I needed to clean up some types. To make things more precise and clean, I tell a minor lie: that the
[path, offset, length]
lists are tuples. This was done in the spirit of duck typing, and I'd argue is more semantically accurate. (Not to mention that tuples serialize to lists in JSON.)docs/releases.rst
api.rst