Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned Export with COPY Generates Excessively Many Small Files within Each Partition #152

Open
aby-kuruvilla-clear opened this issue Feb 27, 2025 · 0 comments

Comments

@aby-kuruvilla-clear
Copy link

Duckdb Version: 1.1.3 and 1.2.0

When using the COPY command to export data into Parquet format with a partition key (e.g., batch_id), DuckDB produces multiple very small files (approximately 50–100KB each). In our use case—exporting around 100 million line items (1 crore) into batches of 5,000 line items each—this results in roughly 20,000 separate partition directories and small file fragments within each.
Current output

\batch_id=1000\
     records_0.parquet (30kb)
     records_1.parquet(70kb)

Expected output

\batch_id=1000\
     records_0.parquet (100kb)

Command used to export :

COPY my_schema.my_table 
TO '/path/to/export' 
(FORMAT 'parquet', PARTITION_BY (batch_id), OVERWRITE_OR_IGNORE, FILENAME_PATTERN 'records_{i}');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant