Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming support for huggingface parquet dataset similar to chunk stream #502

Open
bhimrazy opened this issue Mar 7, 2025 · 0 comments · May be fixed by #505
Open

Add streaming support for huggingface parquet dataset similar to chunk stream #502

bhimrazy opened this issue Mar 7, 2025 · 0 comments · May be fixed by #505
Labels
enhancement New feature or request

Comments

@bhimrazy
Copy link
Collaborator

bhimrazy commented Mar 7, 2025

🚀 Feature Request

Add Streaming Support for Hugging Face Parquet Datasets Similar to Chunk-Based Streaming

Motivation

Currently, the entire dataset must be downloaded before indexing, resulting in significant waiting time before streaming can begin. This approach is inefficient, especially for datasets with a large number of chunks.

Pitch

The proposed solution is to read only the metadata of all Parquet files using byte-range requests, allowing for quick indexing. Streaming can then proceed similarly to LitData's chunk-based streaming approach.

Reference: Introduction to Parquet Format

@bhimrazy bhimrazy added the enhancement New feature or request label Mar 7, 2025
@bhimrazy bhimrazy linked a pull request Mar 10, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant