A Spark Data Source for accessing 🤗 Hugging Face Datasets:
- Stream datasets from Hugging Face as Spark DataFrames
- Select subsets and splits, apply projection and predicate filters
- Save Spark DataFrames as Parquet files to Hugging Face
- Fully distributed
- Authentication via
huggingface-cli login
or tokens - Compatible with Spark 4 (with auto-import)
- Backport for Spark 3.5, 3.4 and 3.3
pip install pyspark_huggingface
Load a dataset (here stanfordnlp/imdb):
import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")
Save to Hugging Face:
# Login with huggingface-cli login
df.write.format("huggingface").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")
Select a split:
test_df = (
spark.read.format("huggingface")
.option("split", "test")
.load("stanfordnlp/imdb")
)
Select a subset/config:
test_df = (
spark.read.format("huggingface")
.option("config", "sample-10BT")
.load("HuggingFaceFW/fineweb-edu")
)
Filters columns and rows (especially efficient for Parquet datasets):
df = (
spark.read.format("huggingface")
.option("filters", '[("language_score", ">", 0.99)]')
.option("columns", '["text", "language_score"]')
.load("HuggingFaceFW/fineweb-edu")
)
While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.
Importing pyspark_huggingface
patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:
>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
The import is only necessary on Spark 3.x to enable the backport.
Spark 4 automatically imports pyspark_huggingface
as soon as it is installed, and registers the "huggingface" data source.