-
Notifications
You must be signed in to change notification settings - Fork 140
api: formalise expression expansion in group-by #2225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@MarcoGorelli hope it's okay that I edited the description This way we get the Listed in #XXXX thing: |
I think the last example would be a surprising behavior of the column
A separate thought (and divergence from the Polars API) would be to include a special selector to include/exclude the group_keys. The default behavior could be to exclude the group_by keys from the result set when using other selectors: # df has columns a, b, c
df.group_by('a').agg(nw.all().sum()) # has b, c in result
df.group_by('a').agg((nw.all() | nw.group_keys()).sum()) # has a, b, c in result This would help keep us away from a static argument in |
I also find it surprising and kinda feeling like considering this a bug in Polars |
Another one to think about is M:1, e.g. df.group_by('a').agg(nw.sum_horizontal(nw.all()).sum().name.suffix('_agg')) Here Polars considers all columns (even the grouped-by keys) in the inner |
I think having
import polars as pl
import duckdb
print(
f"{pl.__version__ = }", # 1.25.2
f"{duckdb.__version__ = }", # 1.2.1
sep='\n',
)
df = pl.DataFrame({
'a': [*'xxyyz'],
'b': [1,2,3,4,5],
'c': [7,8,9,10,11],
})
## Group-by & 1:1 expressions
# polars deviates from traditional SQL group-by behavior because the grouped
# key is inserted into the results by default
print( # 'a' is in output
df
.group_by('a')
.agg(pl.col('b', 'c').sum()),
)
print( # 'a' is not in output, unless explicitly requested
duckdb.sql('''
select sum(b), sum(c)
from df
group by "a"
''')
)
## Group-by & selectors (1:M expressions)
# the grouping column can often not be treated in the same fashion as the value columns
# typically due to a semantic difference in the columns or a strict data type difference
# Polars avoids this issue by excluding the grouping column from the computation
# then inserting it later on
print( # 'a' is in the output, but its sum was not computed
df # this indicates that `pl.all()` was expanded to NOT include 'a'
.group_by('a')
.agg(pl.all().sum()),
)
# duckdb does not perform this exclusion and thus raises an error due to sum(VARCHAR)
# print(
# duckdb.sql('''
# select sum(columns(*)),
# from df
# group by "a"
# ''')
# )
## Group-by & selectors (M:1 expressions)
# Polars fully respects the selector in this case, though this feels almost
# inconsistent with the behavior above as `pl.all()` is being expanded differently
# in these cases
print( # 'a' is in the output, AND is used in the computation
# I was actually surprised type promotion happened here. This may be a bug?
df
.group_by('a')
.agg(
result=pl.sum_horizontal(pl.all())
),
)
# duckdb exhibits the same behavior, but raises instead of automatic casting
# cannot create a list of types VARCHAR and BIGINT
# print(
# duckdb.sql('''
# select list_value(*columns(*)),
# from df
# group by "a"
# ''')
# ) So it seems like the following rules exist for these tools for operations within
I think the underlying cause of confusion here is that Polars includes the grouping column in the result set by default thus ensuring the result set is always |
Thinking about this more I don't think that The issue is what happens with multi-output expressions, and how they get expanded In binary comparisons, we already only allow multi-output expressions to be in the left-hand-side. So, this should be tractable |
A few more thoughts: In ExprMetadata, we could track
These can change according to the following rules:
By the time we get to
This has a few properties that I like:
|
@MarcoGorelli in relation to these rules - what subset of expressions do you think could be permitted in I know it isn't possible yet, but I thought it might be a nice exercise in dogfooding the categories |
Expressions to For the latter part, we should probably also track
So then, in if arg._metadata.n_open_windows > 0: we check if arg._metadata.n_open_windows > arg._metadata.n_closed_windows: Then
Does this make sense? Happy for you to do this |
Really appreciate the detail, thanks @MarcoGorelli!
Mostly, the only bit I'm missing is the why for this part:
but the how seems sound 🙂 Open/Close
Oh I'm interested - don't be silly there's no demand here 😅 Your comment reminds me of (#2078 (comment)) and I think it could be helpful here. The API change could look like this: ExprMetadata.with_window_open()metadata: ExprMetadata
# Current
metadata.with_extra_open_window()
# Proposed
metadata.with_window_open() ExprMetadata.with_window_close()# Current
n_open_windows = metadata.n_open_windows()
if _order_by is not None and metadata.kind.is_window():
n_open_windows -= 1
next_meta = ExprMetadata(
kind,
n_open_windows=n_open_windows,
expansion_kind=metadata.expansion_kind,
)
# Proposed
metadata.with_window_close() ExprMetadata.is_window_open()# Current
if metadata.n_open_windows > 0:
if metadata.n_open_windows > metadata.n_closed_windows:
# Proposed
if metadata.is_window_open(): Those methods could still work internally with Where
|
if arg._metadata.n_open_windows > 0: |
Lines 1590 to 1596 in b563ba7
n_open_windows = self._metadata.n_open_windows | |
if flat_order_by is not None and self._metadata.kind.is_window(): | |
n_open_windows -= 1 | |
current_meta = self._metadata | |
next_meta = ExprMetadata( | |
kind, | |
n_open_windows=n_open_windows, |
narwhals/narwhals/_expression_parsing.py
Lines 312 to 313 in b563ba7
if arg._metadata.n_open_windows: | |
result_n_open_windows += 1 |
Bringing it all the way back to:
and would have to have had no windows in their history
I'm not sure if that part is captured yet.
I imagine the check is if either n_(open|closed)_windows
is non-zero?
We need to formalise the rules around
when
expr
isn't a simple single-column expression.The rules aren't totally clear in Polars either, see
pl.col('a', 'b')
vspl.nth(0, 1)
vspl.all()
pola-rs/polars#21773where I noted that
agg(pl.col('a', 'b').sum())
differs fromagg(pl.nth(0, 1).sum())
.It looks to me like the rule maybe is
When expressions in
agg
are expanded out, group-by keys are excluded, unless they are selected explicitly by name.For example:
group_by('a').agg(nw.all().sum())
: column'a'
should be excluded from theall
expansiongroup_by('a').agg(nw.selectors.by_type(nw.Float32).sum())
: column'a'
should be excluded from theall
expansiongroup_by('a').agg(nw.col('a', 'b').sum())
: column'a'
should be included in theall
expansiongroup_by('a').agg(nw.nth(0, 1).sum())
: column'a'
should be excluded from theall
expansionRegardless of what Polars does, we may want to think of what we think a good rule would look like, as there's no guarantee that Polars would remain stable here anyway
The text was updated successfully, but these errors were encountered: