-
Notifications
You must be signed in to change notification settings - Fork 38
Allow passing config to default_object_store #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
def default_object_store(filepath: str) -> ObjectStore: | ||
def default_object_store( | ||
filepath: str, storage_config: ObjectStoreOptions | None = None | ||
) -> ObjectStore: | ||
import obstore as obs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to replace this function with obs.store.from_url()
but the upstream version doesn't seem to auto-infer the region. Am I missing anything @kylebarron? xref #561
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's correct that obstore.store.from_url
does not auto-infer the S3 region. This is an artifact of AWS-hosted S3 requiring the region but non-AWS-hosted S3-compatible stores not requiring the region. E.g. it supports r2.cloudflarestorage.com
-style urls, but those don't have a region. So it's up to the user to pass in the region if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's something I whipped together, which is a quick port of this file, so you can infer the region only for s3-native urls
from __future__ import annotations
from enum import Enum
from urllib.parse import urlparse
from obstore.store import S3Store
class ObjectStoreScheme(Enum):
AZURE = "azure"
FILE = "file"
S3 = "s3"
S3Like = "s3like"
MEMORY = "memory"
HTTP = "http"
def create_store(url_str: str, config: dict) -> ObjectStore:
scheme, bucket, path = parse_url(url_str)
if scheme == ObjectStoreScheme.S3:
region = infer_region(...)
return S3Store(...)
elif scheme == ObjectStoreScheme.S3Like:
# Don't infer region
return S3Store(...)
elif scheme == ObjectStoreScheme.AZURE:
return AzureStore(...)
elif scheme == ObjectStoreScheme.FILE:
return LocalStore(...)
elif scheme == ObjectStoreScheme.MEMORY:
return MemoryStore(...)
elif scheme == ObjectStoreScheme.HTTP:
return HttpStore(...)
else:
raise ValueError(f"Unsupported URL scheme: {scheme}")
def parse_url(url_str: str) -> tuple[ObjectStoreScheme, str | None, str]:
# scheme, bucket, path
url = urlparse(url_str)
if url.scheme == "file":
return ObjectStoreScheme.FILE, None, url.path
if url.scheme == "memory":
return ObjectStoreScheme.MEMORY, None, url.path
if url.scheme in ["s3", "s3a"]:
assert url.netloc, f"Expected bucket in s3:// url, got: {url_str}"
return ObjectStoreScheme.S3, url.netloc, url.path
if url.scheme == "gs":
assert url.netloc, f"Expected bucket in gs:// url, got: {url_str}"
return ObjectStoreScheme.S3Like, url.netloc, url.path
if url.scheme in ["az", "adl", "azure", "abfs", "abfss"]:
assert url.netloc, f"Expected bucket in azure url, got: {url_str}"
return ObjectStoreScheme.AZURE, url.netloc, url.path
if url.scheme == "http":
return ObjectStoreScheme.HTTP, None, url.path
if url.scheme == "https":
if url.netloc.endswith("amazonaws.com"):
if url.netloc.startswith("s3"):
region = url.netloc.split(".", maxsplit=2)[1]
# TODO: return region from this fn
bucket, path = url.path.removeprefix("/").split("/", maxsplit=1)
return ObjectStoreScheme.S3, bucket, path
else:
return ObjectStoreScheme.S3, None, url.path
if url.netloc.endswith("r2.cloudflarestorage.com"):
return ObjectStoreScheme.S3Like, None, url.path
if url.netloc.endswith("blob.core.windows.net") or url.netloc.endswith(
"dfs.core.windows.net"
):
return ObjectStoreScheme.AZURE, None, url.path
return ObjectStoreScheme.HTTP, None, url.path
raise ValueError("Unrecognized url")
Perhaps you'd want create_store
to take in separate config dicts for better typing and so you can define configs for multiple different stores at once. Something like
def create_store(
url_str: str,
s3_config: S3Config | None = None,
azure_config: AzureConfig | None = None,
gcs_config: GCSConfig | None = None,
) -> ObjectStore: ...
And then you can only infer the region if it's not manually passed in that s3_config
dict
@TomNicholas I need to figure out how to handle the different kwargs for kerchunk vs virtualizarr HDF backends in the tests, but I think this is ready for you to try |
The tests are harder to sort out than I expected because even a obstore based reader with use fsspec when automatically determining the filetype. I'm going to pivot to developer docs now which will help us design storage option configuration. In the meantime @TomNicholas you probably want to specify the |
No description provided.