-
Notifications
You must be signed in to change notification settings - Fork 140
[Enh]: allow slicing LazyFrame #2389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thanks @gdementen for the request could you clarify how you do this for polars please? using |
Polars supports slicing LazyFrames directly, so this is enough: data[v_start:v_stop].select(columns[h_start:h_stop]) |
thanks, i hadn't realised that how do you paginate the results? do you use would an |
Just to demonstrate why I have mixed feelings about Polars allowing this: I'm going to make a UDF ( We'll then load up a lazyframe, which we know has 5 rows. In [8]: lf = pl.scan_csv(io.StringIO('a\n1\n2\n3\n4\n5'))
In [9]: result = lf.with_columns(pl.col('a').map_elements(func, return_dtype=pl.Int64)).sort('a') Now, what happens when you run In [10]: result[:2].collect()
x is 1
x is 2
x is 3
x is 4
x is 5
Out[10]:
shape: (2, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
└─────┘
In [11]: result[2:4].collect()
x is 1
x is 2
x is 3
x is 4
x is 5
Out[11]:
shape: (2, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 3 │
│ 4 │
└─────┘ So, before changing what Narwhals allows, I just want to be very careful that we're not making it too easy for users to "shoot themselves in the foot" 😄 |
I am indeed calling
Sure. |
That's because of the sort. It does not do that if there are only pure element-wise operations & filters AFAICT. |
That's right, it only happens if there are operations which block slice pushdown. But, if you're displaying a lazyframe provided by the user, then you have no control over whether they've applied any such operation, right? I am concerned about allowing this in Narwhals, which is intended as a stricter / safer subset of the Polars API I'm inclined to decline then, and instead suggest:
|
Now that I think of it, I don't think it's a good idea because then the Narwhals API would no longer be a subset of the Polars API, just a different API. |
true, but i think it's ok to add extra keyword arguments where necessary so long as there's no conflict with what's in Polars |
Indeed. That's a price I am ready to pay for the other benefits it gives though. It might be possible to detect any such operation in the query plan and display a warning before collecting such a lazyframe but I am not there yet (won't be for a long time as this is basically a one-man project).
Ok, I understand that.
That wouldn't work for my main/initial usecase (to display larger-than-RAM datasets loaded via scan_xxx). Allowing any lazyframe (including simple expressions on it) was just an after thought which worked nicely as an added bonus, but you raise an interesting point. To be honest, I am content with the result I got when using the Polars API directly. I just thought Narwhals would expand the supported cases for "free".
Would be interesting for pagination, but would not really help my use case (infini-scroll) where I need to be able to access a random chunk in the middle. |
thanks for explaining! i'm keen to understand the use-case more say you want to support duckdb. in that case, showing to get deterministic results, then as you said, you'd need to sort. but then:
it might help to speak about this over a call to understand what to do? |
Thanks a lot for taking that time, it's really appreciated.
I didn't know sql ordering was that "unreliable". I knew it was undefined (i.e., the database engine choose whatever order it wants), but I always assumed it would be consistent between two calls. Now I realize that the databases I work with are all "almost readonly" (read often, updated extremely rarely), which make that assumption appear to be true, but as soon as the database is updated in anyway it breaks (and I can now imagine cases -- e.g. multithreaded engine -- where it breaks even without any modification to the database). Thanks for pointing this to me, and I am sorry for my ignorance. With all that said, I will probably avoid using my code in the general case (and only use it in more controlled cases) as you convinced me it's not a good idea to do this (too many unknowns) but I still think Narwhals should offer some slicing functionality at least when an order by is present, because AFAIK, all the underlying engines offer that functionality. Whether that is efficient for my particular use-case or not should be irrelevant for you 😉. |
If you feel that helps, I am available all day tomorrow. |
i think it breaks even if the database isn't updated? e.g. In [53]: duckdb.sql("""
...: from assets.parquet
...: select
...: sum(price)
...: over (partition by symbol order by date rows between 5 preceding and current row)
...: as price
...: limit 10
...: offset 20
...: """)
Out[53]:
┌────────────────────┐
│ price │
│ double │
├────────────────────┤
│ 221.111 │
│ 222.128 │
│ 223.26300000000003 │
│ 224.672 │
│ 224.584 │
│ 224.06500000000003 │
│ 224.799 │
│ 224.93599999999998 │
│ 225.014 │
│ 224.408 │
├────────────────────┤
│ 10 rows │
└────────────────────┘
In [54]: duckdb.sql("""
...: from assets.parquet
...: select
...: sum(price)
...: over (partition by symbol order by date rows between 5 preceding and current row)
...: as price
...: limit 10
...: offset 20
...: """)
Out[54]:
┌───────────────────┐
│ price │
│ double │
├───────────────────┤
│ 711.569 │
│ 712.867 │
│ 718.608 │
│ 721.159 │
│ 717.2420000000001 │
│ 715.743 │
│ 717.8090000000001 │
│ 716.566 │
│ 714.463 │
│ 713.982 │
├───────────────────┤
│ 10 rows │
└───────────────────┘
In [55]: duckdb.sql("""
...: from assets.parquet
...: select
...: sum(price)
...: over (partition by symbol order by date rows between 5 preceding and current row)
...: as price
...: limit 10
...: offset 20
...: """)
Out[55]:
┌───────────────────┐
│ price │
│ double │
├───────────────────┤
│ 820.964 │
│ 820.198 │
│ 826.316 │
│ 830.922 │
│ 823.4 │
│ 818.117 │
│ 821.014 │
│ 816.3889999999999 │
│ 809.79 │
│ 807.738 │
├───────────────────┤
│ 10 rows │
└───────────────────┘
Sure, will think about it |
Indeed but I assume what you see is because duckdb is multithreaded by default. I suppose it evaluates different partitions by different threads. I don't think this is really relevant to the issue at hand but it still very interesting to satisfy my intellectual curiosity 😉. |
We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?
I am currently creating adapters so that my Qt-based UI (https://github.com/larray-project/larray-editor/) becomes a lot more generic instead of being only useful for our own project. The goal is to make it able to display any grid-like structure (arrays, dataframes, or even some files).
Please describe the purpose of the new feature or describe the problem to solve.
For obvious performance reasons, it is much faster to only compute what is actually visible on the screen, so my adapters try to do exactly that. I implemented that easily using the direct Polars API for both DataFrame and LazyFrame (and Ibis Table) with great result. However, when using the Narwhals Lazy API (on top of a Polars LazyFrame), I have not found any direct way to select specific rows other than top rows. I had to resort to an ugly workaround which works but makes things uncomfortably slow :
I know of https://narwhals-dev.github.io/narwhals/basics/order_dependence/ and the fact that lazyFrame.tail() is deprecated but I would advocate for implementing slicing anyway. After all, even if one cannot guarantee the order (unless an explicit sort was used) and the order is arbitrary, it makes sense to paginate the result. Another argument is that all SQL engines I know of support some kind of offset command for pagination (whether via limit/offset or offset fetch which is standard SQL AFAIK). Some SQL engine require sorting, for offset to work, but in my (limited) experience this is rather the exception than the rule.
Suggest a solution if possible.
One option might be to allow slicing only if the query includes a sort operation. It would not help my specific case though.
If you have tried alternatives, please describe them below.
No response
Additional information that may help us understand your needs.
No response
The text was updated successfully, but these errors were encountered: