Skip to content

huggingface/pyspark_huggingface

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hugging Face x Spark

GitHub release Number of datasets

Spark Data Source for Hugging Face Datasets

A Spark Data Source for accessing 🤗 Hugging Face Datasets:

  • Stream datasets from Hugging Face as Spark DataFrames
  • Select subsets and splits, apply projection and predicate filters
  • Save Spark DataFrames as Parquet files to Hugging Face
  • Fully distributed
  • Authentication via huggingface-cli login or tokens
  • Compatible with Spark 4 (with auto-import)
  • Backport for Spark 3.5, 3.4 and 3.3

Installation

pip install pyspark_huggingface

Usage

Load a dataset (here stanfordnlp/imdb):

import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")

Save to Hugging Face:

# Login with huggingface-cli login
df.write.format("huggingface").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")

Advanced

Select a split:

test_df = (
    spark.read.format("huggingface")
    .option("split", "test")
    .load("stanfordnlp/imdb")
)

Select a subset/config:

test_df = (
    spark.read.format("huggingface")
    .option("config", "sample-10BT")
    .load("HuggingFaceFW/fineweb-edu")
)

Filters columns and rows (especially efficient for Parquet datasets):

df = (
    spark.read.format("huggingface")
    .option("filters", '[("language_score", ">", 0.99)]')
    .option("columns", '["text", "language_score"]')
    .load("HuggingFaceFW/fineweb-edu")
)

Backport

While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.

Importing pyspark_huggingface patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:

>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)

The import is only necessary on Spark 3.x to enable the backport. Spark 4 automatically imports pyspark_huggingface as soon as it is installed, and registers the "huggingface" data source.

About

PySpark custom data source for Hugging Face Datasets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •