Skip to content

[Enh]: Support a more general set of arrays from pyarrow #2237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
matthewwardrop opened this issue Mar 18, 2025 · 12 comments
Open
1 task

[Enh]: Support a more general set of arrays from pyarrow #2237

matthewwardrop opened this issue Mar 18, 2025 · 12 comments
Labels
pyarrow Issue is related to pyarrow backend upstream issue

Comments

@matthewwardrop
Copy link

matthewwardrop commented Mar 18, 2025

Please describe the purpose of the new feature or describe the problem to solve.

Currently, *.from_native support pyarrow chunked arrays, and no other pyarrow array types. But narwhals operations on pyarrow tables sometimes result in arrays of type pyarrow.StringArray or other high-level Array types. I made one small changed in the pyarrow ingestion code to allow for instances of pyarrow.Array instead of ChunkedArray and everything worked fine. Is there are reason that you restricted things to ChunkedArray?

Suggest a solution if possible.

Be a bit more leniant during pyarrow series ingestion.

Tracking

@MarcoGorelli
Copy link
Member

thanks @matthewwardrop for the report!

But narwhals operations on pyarrow tables sometimes result in arrays of type pyarrow.StringArray or other high-level Array types

could you give an example please? that doesn't sound intentional

Is there are reason that you restricted things to ChunkedArray?

I just think of it as the "Series" equivalent to pyarrow.Table - if you get a column out of a table, it's a ChunkedArray

@matthewwardrop
Copy link
Author

Hi @MarcoGorelli ! I don't have a full reproducer yet, I'm afraid. But I had some code that looked like:

series: narwhals.Series  # backend = pyarrow
top_values: list[str] = ["Sydey", ...]
overflow_value: str = "Other"

out = (
    series.alias(name)
    .to_frame()
    .select(
        narwhals.when(narwhals.col(name).is_in(top_values))
        .then(narwhals.col(name))
        .otherwise(narwhals.lit(overflow_value))
    )[name]

And out would sometimes be a pyarrow.StringArray rather than a pyarrow.ChunkedArray. I'll see if I can get a full reproducer working.

@MarcoGorelli
Copy link
Member

thanks - is this still the case on the latest Narwhals version? i remember we did have an issue related to this some time ago but it should have been fixed

@matthewwardrop
Copy link
Author

matthewwardrop commented Mar 18, 2025

Hi again @MarcoGorelli ,

I tested on the latest version of narwhals (1.31) and can confirm that the following code still outputs StringArray types:

import narwhals
import numpy
import pandas
import pyarrow

df = pyarrow.Table.from_pandas(
    pandas.DataFrame({"field": numpy.random.choice(["A", "B", "C", None], 1000)})
)

n = narwhals.from_native(df)

(
    n["field"].to_arrow()
)  # pyarrow.StringArray

When I tried to then import this back through the system via narwhals.from_native() this fails unless I further patch narwhals to accept Array types as I indicated above. I think the error is in the final .to_arrow() method, since the narwhals Series in the line above does report that the backing dataset is a ChunkedArray.

@matthewwardrop
Copy link
Author

matthewwardrop commented Mar 18, 2025

Yeah... the .combine_chunks() in the to_arrow() method appears to be the culprit... but it's not 100% clear to me that this behaviour is particularly bad. However, it would be nice to be able to do .from_native() for any type that gets output by the helper methods.

@MarcoGorelli
Copy link
Member

ah yes, to_arrow outputs an PyArrow Array, as that's what Polars does

I'd suggest either using to_native, or calling pa.chunked_array on the result of to_arrow

@matthewwardrop
Copy link
Author

Is it reasonable, per the original topic, to have narwhals upcast for the user? Or is that out of scope for the .from_native() call?

@matthewwardrop
Copy link
Author

matthewwardrop commented Mar 18, 2025

For more context, this is related to the alignment of namespaces I posted in a previous issue (#2193). I want to be able to cast values of one type to another. Typically that means doing: nw.from_native(<df/series>.to_<backend>())... but that doesn't work just for the pyarrow backend. I can add some epicycles to the conversion logic, but it does seem suboptimal :). And we need to avoid polars because we do process forking, and polars does not like that.

@MarcoGorelli
Copy link
Member

thanks for explaining

I think this is out-of-scope for from_native, but I'm wondering if we should add a combine_chunks: bool argument to to_arrow. In fact I'm a bit surprised that Polars always returns pyarrow.Array...

Would that work for you / make things any easier?

@dangotbanned

This comment has been minimized.

@MarcoGorelli
Copy link
Member

nice, thanks

we could just be one step ahead of Polars and immediately return ChunkedArray 😄 😎

@matthewwardrop
Copy link
Author

Hi @MarcoGorelli ! It definitely seems like a good idea to avoid unnecessary rechunking. Adding the option or waiting until something similar lands upstream seems somewhat reasonable. It's a bit of a surprise that dropping the chunking is the default behaviour :). Thanks for your prompt responses (and apologies for the delays in my own!).

@dangotbanned dangotbanned added pyarrow Issue is related to pyarrow backend upstream issue labels Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyarrow Issue is related to pyarrow backend upstream issue
Projects
None yet
Development

No branches or pull requests

3 participants