Improve performance of filter pre-processing

### Feature Request / Improvement

Hi, apologies in advance for the rather broad scope.

I am using `pyiceberg` to store datasets partitioned in two dimensions and am using basic filtering to retrieve specific parts of the space, e.g. `(x = 0 AND y = 0) OR (x = 1 AND y = 1) OR (x = -1 and y = 2)`. Cardinality is relatively modest with a little over 200 partitions and depending on the dataset, data size can range from under a MiB to over 10GiB.

I have noticed that when I try to retrieve the whole dataset, but still supply the corresponding filter using only `AND`, `OR` and `=` operators, I pay a rather large overhead, and that overhead gets proportionally better if I factorize the filter using two `IN` operators, i.e. `x in (0, 1, 2, ...) AND y IN (-5, -4, -3, ...)`.

For the smallest datasets I manage, the difference is quite noticeable. To load the full dataset:
- without filters: 15s
- with `x IN (...) AND y IN (...)`: 45s
- with `(x = ... AND y = ...) OR (x = ... AND y = ...) OR ...`: 8m30s

I am planning on optimizing this on my side as much as possible, by factorizing my filters as much as possible (while still working around https://github.com/apache/iceberg-python/issues/1937 / https://github.com/apache/arrow/issues/46183), but I am wondering if we might be able to optimize things a little bit on `pyiceberg`'s side as well.

Most importantly, given my datasets are partitioned on `x` and `y`, little time is effectively spent _applying_ the filter, and it appears that most of the overhead is coming from pre-processing the expression objects themselves.

To highlight this, I produced a flamegraph for two runs, one using only the `=` comparison operator and another using `IN` operators instead (both filters covering the same partitions eventually), and using the `search` functionality of the resulting svg, I highlighted the samples that related to the `expression` package of `pyiceberg`:

<img width="1867" alt="py-spy record -o profile.svg -- python reproducer-equalto.py" src="https://github.com/user-attachments/assets/fabe479a-9b2a-49fb-963d-2eb123c298bd" />

<img width="934" alt="py-spy record -o profile.svg -- python reproducer-equalto.py" src="https://github.com/user-attachments/assets/b3e99af4-9e82-4055-92d2-89fde8a2cecc" />

Note how the flamegraph for the run that uses `=` spent 83.5% of its CPU time in code related to `expressions`, whereas the run that uses `IN` only spent 6.4% in the same parts of `pyiceberg`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of filter pre-processing #2037

Feature Request / Improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of filter pre-processing #2037

Description

Feature Request / Improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions