You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf: Reduce memory consumption for WARC reads and improve estimates (#3935)
This PR makes the following changes for `read_warc`:
- Reduce memory consumption
- Adds `WARC-Identified-Payload-Type` as an extracted metadata column
- Improve stats estimation for scan tasks that read WARC
## Reduced memory consumption
When reading a single Common Crawl file, the file size is typically 1GB,
which decompresses to 5GB of data.
Before this Resident Set Size peaks at `5.15GB` while heap size peaks at
`10.98GB`:

After this PR, Resident Set Size peaks at `4.3GB` while heap size peaks
at `6.6GB`, which is more in line with expectations:

## Additional `WARC-Identified-Payload-Type` metadata column
For ease of filtering WARC records, we extract
`WARC-Identified-Payload-Type` from the metadata as its own column.
Since this is an optional column, it is often NULL.
## Stats estimation
A single Common Crawl .warc.gz file is typically 1GB in size, but takes
up ~5GB of memory once decompressed.
For a .warc.gz file with `145,717` records, before this PR we would
estimate:
```
Stats = { Approx num rows = 9,912,769, Approx size bytes = 914.63 MiB,
Accumulated selectivity = 1.00 }
```
After this PR, we now estimate:
```
Stats = { Approx num rows = 167,773, Approx size bytes = 4.34 GiB, Accumulated
selectivity = 1.00 }
```
which is much closer to reality.
### Estimations with pushdowns
When doing `daft.read_warc("file.warc.gz").select("Content-Length")`, we
estimate `1.32 MiB` and in reality store `1.13 MiB`.
When doing
`daft.read_warc("cc-original.warc.gz").select("warc_content")`, we
estimate `4.39 GiB` and in reality store `3.82 GiB`.
0 commit comments