Skip to content

Allow passing config to default_object_store #564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

maxrjones
Copy link
Member

No description provided.

def default_object_store(filepath: str) -> ObjectStore:
def default_object_store(
filepath: str, storage_config: ObjectStoreOptions | None = None
) -> ObjectStore:
import obstore as obs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to replace this function with obs.store.from_url() but the upstream version doesn't seem to auto-infer the region. Am I missing anything @kylebarron? xref #561

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's correct that obstore.store.from_url does not auto-infer the S3 region. This is an artifact of AWS-hosted S3 requiring the region but non-AWS-hosted S3-compatible stores not requiring the region. E.g. it supports r2.cloudflarestorage.com -style urls, but those don't have a region. So it's up to the user to pass in the region if required.

Copy link
Contributor

@kylebarron kylebarron Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's something I whipped together, which is a quick port of this file, so you can infer the region only for s3-native urls

from __future__ import annotations

from enum import Enum
from urllib.parse import urlparse

from obstore.store import S3Store


class ObjectStoreScheme(Enum):
    AZURE = "azure"
    FILE = "file"
    S3 = "s3"
    S3Like = "s3like"
    MEMORY = "memory"
    HTTP = "http"


def create_store(url_str: str, config: dict) -> ObjectStore:
    scheme, bucket, path = parse_url(url_str)
    if scheme == ObjectStoreScheme.S3:
        region = infer_region(...)
        return S3Store(...)
    elif scheme == ObjectStoreScheme.S3Like:
        # Don't infer region
        return S3Store(...)
    elif scheme == ObjectStoreScheme.AZURE:
        return AzureStore(...)
    elif scheme == ObjectStoreScheme.FILE:
        return LocalStore(...)
    elif scheme == ObjectStoreScheme.MEMORY:
        return MemoryStore(...)
    elif scheme == ObjectStoreScheme.HTTP:
        return HttpStore(...)
    else:
        raise ValueError(f"Unsupported URL scheme: {scheme}")


def parse_url(url_str: str) -> tuple[ObjectStoreScheme, str | None, str]:
    # scheme, bucket, path
    url = urlparse(url_str)
    if url.scheme == "file":
        return ObjectStoreScheme.FILE, None, url.path

    if url.scheme == "memory":
        return ObjectStoreScheme.MEMORY, None, url.path

    if url.scheme in ["s3", "s3a"]:
        assert url.netloc, f"Expected bucket in s3:// url, got: {url_str}"
        return ObjectStoreScheme.S3, url.netloc, url.path

    if url.scheme == "gs":
        assert url.netloc, f"Expected bucket in gs:// url, got: {url_str}"
        return ObjectStoreScheme.S3Like, url.netloc, url.path

    if url.scheme in ["az", "adl", "azure", "abfs", "abfss"]:
        assert url.netloc, f"Expected bucket in azure url, got: {url_str}"
        return ObjectStoreScheme.AZURE, url.netloc, url.path

    if url.scheme == "http":
        return ObjectStoreScheme.HTTP, None, url.path

    if url.scheme == "https":
        if url.netloc.endswith("amazonaws.com"):
            if url.netloc.startswith("s3"):
                region = url.netloc.split(".", maxsplit=2)[1]
                # TODO: return region from this fn
                bucket, path = url.path.removeprefix("/").split("/", maxsplit=1)
                return ObjectStoreScheme.S3, bucket, path
            else:
                return ObjectStoreScheme.S3, None, url.path
        if url.netloc.endswith("r2.cloudflarestorage.com"):
            return ObjectStoreScheme.S3Like, None, url.path
        if url.netloc.endswith("blob.core.windows.net") or url.netloc.endswith(
            "dfs.core.windows.net"
        ):
            return ObjectStoreScheme.AZURE, None, url.path

        return ObjectStoreScheme.HTTP, None, url.path

    raise ValueError("Unrecognized url")

Perhaps you'd want create_store to take in separate config dicts for better typing and so you can define configs for multiple different stores at once. Something like

def create_store(
    url_str: str,
    s3_config: S3Config | None = None,
    azure_config: AzureConfig | None = None,
    gcs_config: GCSConfig | None = None,
) -> ObjectStore: ...

And then you can only infer the region if it's not manually passed in that s3_config dict

@maxrjones
Copy link
Member Author

@TomNicholas I need to figure out how to handle the different kwargs for kerchunk vs virtualizarr HDF backends in the tests, but I think this is ready for you to try

@maxrjones
Copy link
Member Author

The tests are harder to sort out than I expected because even a obstore based reader with use fsspec when automatically determining the filetype. I'm going to pivot to developer docs now which will help us design storage option configuration. In the meantime @TomNicholas you probably want to specify the filetype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants