Skip to content

[Enh]: allow slicing LazyFrame #2389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gdementen opened this issue Apr 15, 2025 · 15 comments
Open

[Enh]: allow slicing LazyFrame #2389

gdementen opened this issue Apr 15, 2025 · 15 comments
Labels
enhancement New feature or request

Comments

@gdementen
Copy link

gdementen commented Apr 15, 2025

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

I am currently creating adapters so that my Qt-based UI (https://github.com/larray-project/larray-editor/) becomes a lot more generic instead of being only useful for our own project. The goal is to make it able to display any grid-like structure (arrays, dataframes, or even some files).

Please describe the purpose of the new feature or describe the problem to solve.

For obvious performance reasons, it is much faster to only compute what is actually visible on the screen, so my adapters try to do exactly that. I implemented that easily using the direct Polars API for both DataFrame and LazyFrame (and Ibis Table) with great result. However, when using the Narwhals Lazy API (on top of a Polars LazyFrame), I have not found any direct way to select specific rows other than top rows. I had to resort to an ugly workaround which works but makes things uncomfortably slow :

wh_index = data.with_row_index('_index')
filter_ = (nw.col('_index') >= v_start) & (nw.col('_index') < v_stop)
# .select also implicitly drops _index
lazy_sub_df = wh_index.filter(filter_).select(columns[h_start:h_stop])

I know of https://narwhals-dev.github.io/narwhals/basics/order_dependence/ and the fact that lazyFrame.tail() is deprecated but I would advocate for implementing slicing anyway. After all, even if one cannot guarantee the order (unless an explicit sort was used) and the order is arbitrary, it makes sense to paginate the result. Another argument is that all SQL engines I know of support some kind of offset command for pagination (whether via limit/offset or offset fetch which is standard SQL AFAIK). Some SQL engine require sorting, for offset to work, but in my (limited) experience this is rather the exception than the rule.

Suggest a solution if possible.

One option might be to allow slicing only if the query includes a sort operation. It would not help my specific case though.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

@MarcoGorelli
Copy link
Member

thanks @gdementen for the request

could you clarify how you do this for polars please? using .head / tail?

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Apr 15, 2025
@gdementen
Copy link
Author

Polars supports slicing LazyFrames directly, so this is enough:

data[v_start:v_stop].select(columns[h_start:h_stop])

@MarcoGorelli
Copy link
Member

thanks, i hadn't realised that slice was supported in LazyFrame.__getitem__

how do you paginate the results? do you use LazyFrame.__getitem__ multiple times and then collect each? if so i worry that that would involve doing repeated calculations

would an offset argument in LazyFrame.head suffice for you?

@MarcoGorelli
Copy link
Member

Just to demonstrate why I have mixed feelings about Polars allowing this:

I'm going to make a UDF (func) which just prints out its input. This is just to see which elements gets processed.

We'll then load up a lazyframe, which we know has 5 rows.

In [8]: lf = pl.scan_csv(io.StringIO('a\n1\n2\n3\n4\n5'))

In [9]: result = lf.with_columns(pl.col('a').map_elements(func, return_dtype=pl.Int64)).sort('a')

Now, what happens when you run result[:2].collect(), and then result[2:4].collect()? You may expect that Polars is running the UDF for the first two elements and then for the next two. But no. It re-runs it for all elements on all collect statements:

In [10]: result[:2].collect()
x is 1
x is 2
x is 3
x is 4
x is 5
Out[10]:
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

In [11]: result[2:4].collect()
x is 1
x is 2
x is 3
x is 4
x is 5
Out[11]:
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 4   │
└─────┘

So, before changing what Narwhals allows, I just want to be very careful that we're not making it too easy for users to "shoot themselves in the foot" 😄

@gdementen
Copy link
Author

how do you paginate the results? do you use LazyFrame.__getitem__ multiple times and then collect each? if so i worry that that would involve doing repeated calculations

I am indeed calling __getitem__ multiple time, but only when the user actually scrolls (and I have a buffer which is larger than the screen so that it does not happen too often).

would an offset argument in LazyFrame.head suffice for you?

Sure.

@gdementen
Copy link
Author

Now, what happens when you run result[:2].collect(), and then result[2:4].collect()? You may expect that Polars is running the UDF for the first two elements and then for the next two. But no. It re-runs it for all elements on all collect statements:

That's because of the sort. It does not do that if there are only pure element-wise operations & filters AFAICT.

@MarcoGorelli
Copy link
Member

That's right, it only happens if there are operations which block slice pushdown. But, if you're displaying a lazyframe provided by the user, then you have no control over whether they've applied any such operation, right?

I am concerned about allowing this in Narwhals, which is intended as a stricter / safer subset of the Polars API

I'm inclined to decline then, and instead suggest:

  • you collect before slicing
  • we wait for Polars / other lazy frames offer some API with which streams through lazyframe in chunks. I'm not aware of any such API at the moment

@gdementen
Copy link
Author

would an offset argument in LazyFrame.head suffice for you?

Now that I think of it, I don't think it's a good idea because then the Narwhals API would no longer be a subset of the Polars API, just a different API.

@MarcoGorelli
Copy link
Member

true, but i think it's ok to add extra keyword arguments where necessary so long as there's no conflict with what's in Polars

@gdementen
Copy link
Author

That's right, it only happens if there are operations which block slice pushdown. But, if you're displaying a lazyframe provided by the user, then you have no control over whether they've applied any such operation, right?

Indeed. That's a price I am ready to pay for the other benefits it gives though. It might be possible to detect any such operation in the query plan and display a warning before collecting such a lazyframe but I am not there yet (won't be for a long time as this is basically a one-man project).

I am concerned about allowing this in Narwhals, which is intended as a stricter / safer subset of the Polars API

Ok, I understand that.

I'm inclined to decline then, and instead suggest:

* you `collect` before slicing

That wouldn't work for my main/initial usecase (to display larger-than-RAM datasets loaded via scan_xxx). Allowing any lazyframe (including simple expressions on it) was just an after thought which worked nicely as an added bonus, but you raise an interesting point. To be honest, I am content with the result I got when using the Polars API directly. I just thought Narwhals would expand the supported cases for "free".

* we wait for Polars / other lazy frames offer some API with which streams through lazyframe in chunks. I'm not aware of any such API at the moment

Would be interesting for pagination, but would not really help my use case (infini-scroll) where I need to be able to access a random chunk in the middle.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Apr 15, 2025

thanks for explaining!

i'm keen to understand the use-case more

say you want to support duckdb. in that case, showing from table select * limit 10 and then from table select * limit 10 offset 10 doesn't guarantee that you showed the next 10 rows after the first one - in fact, even from table select * limit 10 can give different answers if you run it multiple times

to get deterministic results, then as you said, you'd need to sort. but then:

  • using sort (.sort(...)[10:20].collect()) would end up with Polars repeatedly carrying out the same operations ([Enh]: allow slicing LazyFrame #2389 (comment)). I don't know if it would be the same for sql engines
  • without the sort ([10:20].collect()) would essentially be like repeatedly sampling 10 elements for sql engines like duckdb

it might help to speak about this over a call to understand what to do?

@gdementen
Copy link
Author

i'm keen to understand the use-case more

Thanks a lot for taking that time, it's really appreciated.

say you want to support duckdb. in that case, showing from table select * limit 10 and then from table select * limit 10 offset 10 doesn't guarantee that you showed the next 10 rows after the first one - in fact, even from table select * limit 10 can give different answers if you run it multiple times

I didn't know sql ordering was that "unreliable". I knew it was undefined (i.e., the database engine choose whatever order it wants), but I always assumed it would be consistent between two calls. Now I realize that the databases I work with are all "almost readonly" (read often, updated extremely rarely), which make that assumption appear to be true, but as soon as the database is updated in anyway it breaks (and I can now imagine cases -- e.g. multithreaded engine -- where it breaks even without any modification to the database). Thanks for pointing this to me, and I am sorry for my ignorance.

With all that said, I will probably avoid using my code in the general case (and only use it in more controlled cases) as you convinced me it's not a good idea to do this (too many unknowns) but I still think Narwhals should offer some slicing functionality at least when an order by is present, because AFAIK, all the underlying engines offer that functionality.

Whether that is efficient for my particular use-case or not should be irrelevant for you 😉.

@gdementen
Copy link
Author

it might help to speak about this over a call to understand what to do?

If you feel that helps, I am available all day tomorrow.

@MarcoGorelli
Copy link
Member

but as soon as the database is updated in anyway it breaks

i think it breaks even if the database isn't updated?

e.g.

In [53]: duckdb.sql("""
    ...: from assets.parquet
    ...: select
    ...:     sum(price)
    ...:     over (partition by symbol order by date rows between 5 preceding and current row)
    ...:     as price
    ...: limit 10
    ...: offset 20
    ...: """)
Out[53]:
┌────────────────────┐
│       price        │
│       double       │
├────────────────────┤
│            221.111 │
│            222.128 │
│ 223.26300000000003 │
│            224.672 │
│            224.584 │
│ 224.06500000000003 │
│            224.799 │
│ 224.93599999999998 │
│            225.014 │
│            224.408 │
├────────────────────┤
│      10 rows       │
└────────────────────┘

In [54]: duckdb.sql("""
    ...: from assets.parquet
    ...: select
    ...:     sum(price)
    ...:     over (partition by symbol order by date rows between 5 preceding and current row)
    ...:     as price
    ...: limit 10
    ...: offset 20
    ...: """)
Out[54]:
┌───────────────────┐
│       price       │
│      double       │
├───────────────────┤
│           711.569 │
│           712.867 │
│           718.608 │
│           721.159 │
│ 717.2420000000001 │
│           715.743 │
│ 717.8090000000001 │
│           716.566 │
│           714.463 │
│           713.982 │
├───────────────────┤
│      10 rows      │
└───────────────────┘

In [55]: duckdb.sql("""
    ...: from assets.parquet
    ...: select
    ...:     sum(price)
    ...:     over (partition by symbol order by date rows between 5 preceding and current row)
    ...:     as price
    ...: limit 10
    ...: offset 20
    ...: """)
Out[55]:
┌───────────────────┐
│       price       │
│      double       │
├───────────────────┤
│           820.964 │
│           820.198 │
│           826.316 │
│           830.922 │
│             823.4 │
│           818.117 │
│           821.014 │
│ 816.3889999999999 │
│            809.79 │
│           807.738 │
├───────────────────┤
│      10 rows      │
└───────────────────┘

I still think Narwhals should offer some slicing functionality at least when an order by is present, because AFAIK, all the underlying engines offer that functionality.

Sure, will think about it

@gdementen
Copy link
Author

i think it breaks even if the database isn't updated?

Indeed but I assume what you see is because duckdb is multithreaded by default. I suppose it evaluates different partitions by different threads. I don't think this is really relevant to the issue at hand but it still very interesting to satisfy my intellectual curiosity 😉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants