Description
Feature Request / Improvement
Hi, apologies in advance for the rather broad scope.
I am using pyiceberg
to store datasets partitioned in two dimensions and am using basic filtering to retrieve specific parts of the space, e.g. (x = 0 AND y = 0) OR (x = 1 AND y = 1) OR (x = -1 and y = 2)
. Cardinality is relatively modest with a little over 200 partitions and depending on the dataset, data size can range from under a MiB to over 10GiB.
I have noticed that when I try to retrieve the whole dataset, but still supply the corresponding filter using only AND
, OR
and =
operators, I pay a rather large overhead, and that overhead gets proportionally better if I factorize the filter using two IN
operators, i.e. x in (0, 1, 2, ...) AND y IN (-5, -4, -3, ...)
.
For the smallest datasets I manage, the difference is quite noticeable. To load the full dataset:
- without filters: 15s
- with
x IN (...) AND y IN (...)
: 45s - with
(x = ... AND y = ...) OR (x = ... AND y = ...) OR ...
: 8m30s
I am planning on optimizing this on my side as much as possible, by factorizing my filters as much as possible (while still working around #1937 / apache/arrow#46183), but I am wondering if we might be able to optimize things a little bit on pyiceberg
's side as well.
Most importantly, given my datasets are partitioned on x
and y
, little time is effectively spent applying the filter, and it appears that most of the overhead is coming from pre-processing the expression objects themselves.
To highlight this, I produced a flamegraph for two runs, one using only the =
comparison operator and another using IN
operators instead (both filters covering the same partitions eventually), and using the search
functionality of the resulting svg, I highlighted the samples that related to the expression
package of pyiceberg
:


Note how the flamegraph for the run that uses =
spent 83.5% of its CPU time in code related to expressions
, whereas the run that uses IN
only spent 6.4% in the same parts of pyiceberg
.