[Enh]: Support a more general set of arrays from pyarrow #2237

matthewwardrop · 2025-03-18T02:43:38Z

Please describe the purpose of the new feature or describe the problem to solve.

Currently, *.from_native support pyarrow chunked arrays, and no other pyarrow array types. But narwhals operations on pyarrow tables sometimes result in arrays of type pyarrow.StringArray or other high-level Array types. I made one small changed in the pyarrow ingestion code to allow for instances of pyarrow.Array instead of ChunkedArray and everything worked fine. Is there are reason that you restricted things to ChunkedArray?

Suggest a solution if possible.

Be a bit more leniant during pyarrow series ingestion.

Tracking

Dataframe to_arrow causes silent rechunk pola-rs/polars#21738

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2025-03-18T08:19:46Z

thanks @matthewwardrop for the report!

But narwhals operations on pyarrow tables sometimes result in arrays of type pyarrow.StringArray or other high-level Array types

could you give an example please? that doesn't sound intentional

Is there are reason that you restricted things to ChunkedArray?

I just think of it as the "Series" equivalent to pyarrow.Table - if you get a column out of a table, it's a ChunkedArray

matthewwardrop · 2025-03-18T18:06:28Z

Hi @MarcoGorelli ! I don't have a full reproducer yet, I'm afraid. But I had some code that looked like:

series: narwhals.Series  # backend = pyarrow
top_values: list[str] = ["Sydey", ...]
overflow_value: str = "Other"

out = (
    series.alias(name)
    .to_frame()
    .select(
        narwhals.when(narwhals.col(name).is_in(top_values))
        .then(narwhals.col(name))
        .otherwise(narwhals.lit(overflow_value))
    )[name]

And out would sometimes be a pyarrow.StringArray rather than a pyarrow.ChunkedArray. I'll see if I can get a full reproducer working.

MarcoGorelli · 2025-03-18T18:57:39Z

thanks - is this still the case on the latest Narwhals version? i remember we did have an issue related to this some time ago but it should have been fixed

matthewwardrop · 2025-03-18T20:22:15Z

Hi again @MarcoGorelli ,

I tested on the latest version of narwhals (1.31) and can confirm that the following code still outputs StringArray types:

import narwhals
import numpy
import pandas
import pyarrow

df = pyarrow.Table.from_pandas(
    pandas.DataFrame({"field": numpy.random.choice(["A", "B", "C", None], 1000)})
)

n = narwhals.from_native(df)

(
    n["field"].to_arrow()
)  # pyarrow.StringArray

When I tried to then import this back through the system via narwhals.from_native() this fails unless I further patch narwhals to accept Array types as I indicated above. I think the error is in the final .to_arrow() method, since the narwhals Series in the line above does report that the backing dataset is a ChunkedArray.

matthewwardrop · 2025-03-18T20:28:33Z

Yeah... the .combine_chunks() in the to_arrow() method appears to be the culprit... but it's not 100% clear to me that this behaviour is particularly bad. However, it would be nice to be able to do .from_native() for any type that gets output by the helper methods.

MarcoGorelli · 2025-03-18T20:44:53Z

ah yes, to_arrow outputs an PyArrow Array, as that's what Polars does

I'd suggest either using to_native, or calling pa.chunked_array on the result of to_arrow

matthewwardrop · 2025-03-18T20:47:07Z

Is it reasonable, per the original topic, to have narwhals upcast for the user? Or is that out of scope for the .from_native() call?

matthewwardrop · 2025-03-18T20:50:15Z

For more context, this is related to the alignment of namespaces I posted in a previous issue (#2193). I want to be able to cast values of one type to another. Typically that means doing: nw.from_native(<df/series>.to_<backend>())... but that doesn't work just for the pyarrow backend. I can add some epicycles to the conversion logic, but it does seem suboptimal :). And we need to avoid polars because we do process forking, and polars does not like that.

MarcoGorelli · 2025-03-19T15:05:00Z

thanks for explaining

I think this is out-of-scope for from_native, but I'm wondering if we should add a combine_chunks: bool argument to to_arrow. In fact I'm a bit surprised that Polars always returns pyarrow.Array...

Would that work for you / make things any easier?

MarcoGorelli · 2025-03-19T15:37:39Z

nice, thanks

we could just be one step ahead of Polars and immediately return ChunkedArray 😄 😎

matthewwardrop · 2025-03-25T18:26:23Z

Hi @MarcoGorelli ! It definitely seems like a good idea to avoid unnecessary rechunking. Adding the option or waiting until something similar lands upstream seems somewhat reasonable. It's a bit of a surprise that dropping the chunking is the default behaviour :). Thanks for your prompt responses (and apologies for the delays in my own!).

This comment has been minimized.

Sign in to view

dangotbanned added pyarrow Issue is related to pyarrow backend upstream issue labels Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enh]: Support a more general set of arrays from pyarrow #2237

[Enh]: Support a more general set of arrays from pyarrow #2237

matthewwardrop commented Mar 18, 2025 •

edited by dangotbanned

Loading

MarcoGorelli commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025

MarcoGorelli commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025 •

edited

Loading

matthewwardrop commented Mar 18, 2025 •

edited

Loading

MarcoGorelli commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025 •

edited

Loading

MarcoGorelli commented Mar 19, 2025

This comment has been minimized.

MarcoGorelli commented Mar 19, 2025

matthewwardrop commented Mar 25, 2025

[Enh]: Support a more general set of arrays from pyarrow #2237

[Enh]: Support a more general set of arrays from pyarrow #2237

Comments

matthewwardrop commented Mar 18, 2025 • edited by dangotbanned Loading

Please describe the purpose of the new feature or describe the problem to solve.

Suggest a solution if possible.

Tracking

MarcoGorelli commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025

MarcoGorelli commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025 • edited Loading

matthewwardrop commented Mar 18, 2025 • edited Loading

MarcoGorelli commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025

matthewwardrop commented Mar 18, 2025 • edited Loading

MarcoGorelli commented Mar 19, 2025

This comment has been minimized.

MarcoGorelli commented Mar 19, 2025

matthewwardrop commented Mar 25, 2025

matthewwardrop commented Mar 18, 2025 •

edited by dangotbanned

Loading

matthewwardrop commented Mar 18, 2025 •

edited

Loading

matthewwardrop commented Mar 18, 2025 •

edited

Loading

matthewwardrop commented Mar 18, 2025 •

edited

Loading