Skip to content

Improve performance of filter pre-processing #2037

Open
@ypsah

Description

@ypsah

Feature Request / Improvement

Hi, apologies in advance for the rather broad scope.

I am using pyiceberg to store datasets partitioned in two dimensions and am using basic filtering to retrieve specific parts of the space, e.g. (x = 0 AND y = 0) OR (x = 1 AND y = 1) OR (x = -1 and y = 2). Cardinality is relatively modest with a little over 200 partitions and depending on the dataset, data size can range from under a MiB to over 10GiB.

I have noticed that when I try to retrieve the whole dataset, but still supply the corresponding filter using only AND, OR and = operators, I pay a rather large overhead, and that overhead gets proportionally better if I factorize the filter using two IN operators, i.e. x in (0, 1, 2, ...) AND y IN (-5, -4, -3, ...).

For the smallest datasets I manage, the difference is quite noticeable. To load the full dataset:

  • without filters: 15s
  • with x IN (...) AND y IN (...): 45s
  • with (x = ... AND y = ...) OR (x = ... AND y = ...) OR ...: 8m30s

I am planning on optimizing this on my side as much as possible, by factorizing my filters as much as possible (while still working around #1937 / apache/arrow#46183), but I am wondering if we might be able to optimize things a little bit on pyiceberg's side as well.

Most importantly, given my datasets are partitioned on x and y, little time is effectively spent applying the filter, and it appears that most of the overhead is coming from pre-processing the expression objects themselves.

To highlight this, I produced a flamegraph for two runs, one using only the = comparison operator and another using IN operators instead (both filters covering the same partitions eventually), and using the search functionality of the resulting svg, I highlighted the samples that related to the expression package of pyiceberg:

py-spy record -o profile.svg -- python reproducer-equalto.py py-spy record -o profile.svg -- python reproducer-equalto.py

Note how the flamegraph for the run that uses = spent 83.5% of its CPU time in code related to expressions, whereas the run that uses IN only spent 6.4% in the same parts of pyiceberg.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions