Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-10: Add pyarrow as a required dependency #52711

Merged
merged 40 commits into from
Jul 30, 2023
Merged
Changes from 21 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
89a3a3b
Start pdep 10
mroeschke Apr 14, 2023
cf88b43
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 17, 2023
dafa709
finish drawbacks, fix other sections
mroeschke Apr 17, 2023
5e1fbd1
Add number
mroeschke Apr 17, 2023
44a3321
our current version is 7 not 6
mroeschke Apr 17, 2023
ea9f5e3
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 18, 2023
fbd1aa0
Clarify and fix typo
mroeschke Apr 18, 2023
6d667b4
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
bed5f0b
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
12622bb
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
864b8d1
Add string as a preferential pyarrow type
mroeschke Apr 21, 2023
2d4f4fd
Add metric about number of pyarrow import checks
mroeschke Apr 21, 2023
bb332ca
Clarify with actual call
mroeschke Apr 21, 2023
a8275fa
Clarify with actual call
mroeschke Apr 21, 2023
1148007
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 28, 2023
b406dc1
Address some comments
mroeschke Apr 28, 2023
ecc4d5b
Update 0010-required-pyarrow-dependency.md
phofl Apr 28, 2023
ec1c0e3
Update 0010-required-pyarrow-dependency.md
phofl Apr 28, 2023
23eb251
add Patrick as an author, remove constraint on only bumping during ma…
mroeschke Apr 28, 2023
dd7c62a
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke May 9, 2023
2ddd82a
Change required proposal for 3.0 to be version requiring pyarrow & st…
mroeschke May 9, 2023
3c54d22
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke May 9, 2023
1b60fbb
Address typos
mroeschke May 9, 2023
70cdf74
Merge branch 'main' into pdep/pyarrow
mroeschke May 24, 2023
14602a6
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 1, 2023
2cfb92f
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 9, 2023
e0e406c
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 20, 2023
f047032
Update 0010-required-pyarrow-dependency.md
phofl Jul 2, 2023
ed28c04
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Jul 3, 2023
99de932
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
99fd739
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
9384bc7
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
c3beeb3
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
8347e83
improve structure, list user benefits more clearly, add faq
MarcoGorelli Jul 5, 2023
d740403
restore little demo
MarcoGorelli Jul 5, 2023
959873e
remove masked part, note that pyarrow dtyeps will likely be ready by 3
MarcoGorelli Jul 5, 2023
f936280
Merge pull request #26 from MarcoGorelli/pdep10-amendments
mroeschke Jul 6, 2023
2db0037
Update 0010-required-pyarrow-dependency.md
phofl Jul 13, 2023
c2b8cfe
Merge branch 'main' into pdep/pyarrow
mroeschke Jul 25, 2023
4e05151
Update 0010-required-pyarrow-dependency.md
phofl Jul 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions web/pandas/pdeps/0010-required-pyarrow-dependency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# PDEP-10: PyArrow as a required dependency for default string inference implementation

- Created: 17 April 2023
- Status: Under discussion
- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711)
[#52509](https://github.com/pandas-dev/pandas/issues/52509)
- Author: [Matthew Roeschke](https://github.com/mroeschke)
[Patrick Hoefler](https://github.com/phofl)
- Revision: 1

## Abstract

This PDEP proposes that:

- PyArrow becomes a runtime dependency starting with pandas 3.0
- The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow.
- When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has
been released for at least 2 years.
- Starting in pandas 2.1, pandas raises a ``FutureWarning`` when needing to infer string data that the future
data type result will be `ArrowDtype` with `pyarrow.string` instead of object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to go further, and raise a FutureWarning upon import pandas if pyarrow isn't installed, warning that in the future it will become a required dependency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to go further, and raise a FutureWarning upon import pandas if pyarrow isn't installed, warning that in the future it will become a required dependency?

I agree with Marco here. I'd also suggest that if we go that route, the message points to a Github issue where we can gather feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah a feedback issue is a very good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think feedback is also a great idea, but isn't raising a warning on import so soon after just releasing 2.0 for the next major release counterproductive for the whole user experience? Not aware of any other solution but I think this might cause a lot of frustrations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think feedback is also a great idea, but isn't raising a warning on import so soon after just releasing 2.0 for the next major release counterproductive for the whole user experience? Not aware of any other solution but I think this might cause a lot of frustrations.

And the frustration can be solved by the user by installing pyarrow . If they don't want to do that, we'll get the feedback and maybe have to back off on making it a requirement if we get lots of frustrated users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, almost every regular pandas user will get an intimidating warning just by importing the module, feeling they did something wrong when they didn't.

That depends on how we word the warning. If we say something like "You better install pyarrow now or everything will break", that will scare them. If we say something like "Starting with pandas 3.0, pyarrow will become a required installed dependency for pandas. Install it now to identify any potential issues and to remove this warning. Report issues to https://github.com/pandas-dev/pandas/issues/xxxxx" I don't think the latter is intimidating.

Having said that, I think the specifics of when this warning will appear should be detailed as part of this PDEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, "just" warning when inferring string data will result in practically every user seeing the warning anyway, so maybe that's enough (if I understand correctly?)

So, if I install pandas 2.1.0 and don't have pyarrow installed, then pd.Series(['foo']) would raise a FutureWarning telling me that in the future the default will be a pyarrow string dtype, and that to opt-in to the new behaviour I need to install pyarrow and set dtype='string[pyarrow]'? Whereas if I did have pyarrow installed, then the warning would just say to set dtype='string[pyarrow]'?

Setting dtype= everywhere to silence the warning could be quite a lot of work, maybe there's a simpler way for users to opt-in to this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't personally warn for that either. Afaik there is no change in the behavior when changing the data type of strings to be pyarrow. While we let users see and chose the type, I think it's more of an implementation detail than anything the user should care about.

We will be writing in the documentation, blogs... About the change for advanced users to know. But for most pandas users it's a change they don't care about, and I don't think we should be annoying them showing them warnings, or asking them to be explicit with data types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli this is a good point and compromise. Since one of the arguments in favour of this PDEP regards string treatment this seems like a good place to put the warning and relay the message about feedback in a GH issue.
It also allows a pandas global variable to suppress warnings such as this rather than have to rely on environment variable to suppress an 'on import' warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I am leaning towards @MarcoGorelli's suggestion of only warning when pandas needs to perform string inference in 2.1. I also think it's a good medium of warning that pyarrow will be required and where it will make a difference in 3.0.

- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string`
instead of `object`

## Background

PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas:

- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet
- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an
optional string data type backed by PyArrow
- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV
- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow
data types within the `ExtensionArray` interface
- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods
now utilize PyArrow compute functions to
accelerate PyArrow-backed data in pandas, notibly string and datetime types.

As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:

1. Consistent `NA` support for all data types
2. Broader support of data types such as `decimal`, `date` and nested types

Additionally, when users pass string data into pandas constructors without specifying a data type, the result data type
is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
the inferred type to the more efficient pyarrow string type.

```python
In [1]: import pandas as pd

In [2]: pd.Series(["a"]).dtype
Out[2]: dtype('O')
```

## Motivation

While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
ecosystem to pandas users.

Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
functionality that would be better suited by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

Copy link
Member

@MarcoGorelli MarcoGorelli Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this comment by Will has been addressed (unless I missed it?)

to make it easier to find: the link is here, and says:

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?


- Removing redundant functionality:
- fastparquet engine in `read_parquet`
- potentially simplifying the `read_csv` logic (needs more investigation)

- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as:
- decimal
- binary
- nested types (list or dict data)
- strings

Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't kept up with this, but how are the plans to add the new numpy string dtype (xref #47884 ) going to affect the rationale here?

I would assume performance of the numpy string dtype would be on par with the pyarrow one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still years' away #52711 (comment)

I can't remember the perf comparison - @ngoldbaum do you want to comment here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked comment said that numpy strings are available "within a year or so".
This does not seem to be dissimilar to a pandas 3.0 release date now proposed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted that as "ready within numpy" - adding in extra time to make them available in pandas, plus accounting for Hofstadter's law, "year's away" seems realistic

Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.[2]

(Nathan - we discussed timelines before, but I didn't write them down so have forgotten them, apologies)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still years' away

I hope it doesn't take that long!

(Nathan - we discussed timelines before, but I didn't write them down so have forgotten them, apologies)

The earliest pandas could officially support the dtype I'm working on is after the release of Numpy 2.0 - currently scheduled for January 2024. This assumes the new dtype API is available for downstream use in Numpy 2.0 without needing to set an environment variable. I'm hoping to start shipping experimental support in pandas behind the environment variable after Numpy 1.25 is released this summer, as that version of Numpy will hopefully have a version of the new dtype API that is usable for pandas' needs. The version in Numpy 1.24 is broken and is missing a lot of features we've added since that release, unfortunately.

Copy link
Contributor

@ngoldbaum ngoldbaum Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the ExtensionArray on your fork compare to the string[pyarrow] implementation?

The memory usage should be comparable with pyarrow strings. Both are storing UTF8 bytestreams internally. I don't know offhand if arrow uses the small string optimization (storing the string content in the space normally reserved for a pointer to the string). It's difficult to compare memory usage exactly since the operating system facilities for this only allow you to measure the peak memory usage of a process and not all allocations necessarily use Python's allocation tracking machinery. I'm hoping to do a more careful memory usage benchmark as part of the NEP I'm writing.

The main difference in the storage is that right now I'm using individual heap allocations for each string array entry. Arrow just does a single allocation for all the array entries and has a secondary array of offsets to find the data for each string array element. I've thought a bit about following that approach, but it would mean we would have to either disallow mutating string arrays or there or have pathological behavior where enlarging a single array element could cause the entire array to get reallocated. It would also be nice to be able to use the short string optimization, arrows approach with an array of offsets would make that more difficult.

For performance, do you mean for string manipulation operations like case folding or padding? In principle NumPy could add string ufuncs that would allow for fast implementations, but right now NumPy doesn't have a namespace for that. Currently, all the comparison operators are implemented as ufuncs, but no other string functionality is. There are string manipulation functions in the np.char namespace, but they just do a for loop over the array elements and call string functions on the scalars.

I don't want to promise that string ufuncs definitely will happen in the future, but there's no real technical blockers, just social ones. NumPy doesn't currently have any ufuncs that only make sense with string data, so some thought needs to go into where in the namespace they should go. It will also require a decent amount of implementation work to add the functions, although mostly just tedious C coding.

Overall the goal is to facilitate a straightforward transition from workflows that used object string arrays while enabling possible performance improvements in the future that are currently impossible with object string arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, pandas is moving away from mutability in some sense (the CoW adoption), so that isn't very high on my priority list.

While a storage efficient string dtype is nice, this is kind of pointless if the operations aren't fast from a pandas PoV. One of the biggest advantages of arrow is that we can reduce memory but also that most operations are significantly faster, depending on what you are doing it's can be an order of magnitude.

I am referring mostly to stuff like the str accessor but also things like factorization etc.

So even if NumPy strings are ready in around a year (or some other time period), that's not helpful for us as long as NumPy does not ship fast algorithms on top of it.

Sorry if this sounds harsh, that wasn't my intention. But having the string dtype without algorithms gets us only half the way compared to what PyArrow does, so this isn't a compelling argument to avoid making Arrow strings the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a minimum, a fast regex engine could potentially help as some of the str accessor functions were (maybe still are) implemented using regex for string[pyarrow] where the functions did not exist in PyArrow (or the minimum version supported at the time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to implement this in pandas? That's something I personally don't have any interest in doing and would also be at least 0- on adding for the time being. Having this stuff in Arrow is nice since it reduces maintenance burden and also having better test coverage since more libraries will depend on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect a regex engine would be implemented in Numpy and then any str accessor functions not implemented in NumPy could be implemented using either regex or object fallback in pandas (just like we did for PyArrow initially).


**Performance:**

```python
import string
import random

import pandas as pd


def random_string() -> str:
return "".join(random.choices(string.printable, k=random.randint(10, 100)))


ser_object = pd.Series([random_string() for _ in range(1_000_000)])
ser_string = ser_object.astype("string[pyarrow]")\
```

PyArrow backed strings are significantly faster than NumPy object strings:

*str.len*

```python
In[1]: %timeit ser_object.str.len()
118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[2]: %timeit ser_string.str.len()
24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

*str.startswith*

```python
In[3]: %timeit ser_object.str.startswith("a")
136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[4]: %timeit ser_string.str.startswith("a")
11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Another advantage is I/O. PyArrow engines in pandas can provide a significant speedup. Currently, the data
are cast to NumPy dtypes, which requires roundtripping when converting back to PyArrow strings explicitly, which
hinders performance.

**Memory**

PyArrow backed strings use significantly less memory. Dask developers investigated this [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes).

Short summary: PyArrow strings required 1/3 of the original memory.


## Drawbacks

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow
using pip from wheels, numpy and pandas are about `70MB`, and PyArrow is around `120MB`. An increase of installation size would
have negative impliciation using pandas in space-constrained development or deployment environments such as AWS Lambda.

Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`,
the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include

- Alpine linux (commonly used as a base for Docker containers)
- WASM (pyodide and pyscript)
- Python development versions

Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when
supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version
before releasing a new pandas version.

### PDEP-1 History

- 17 April 2023: Initial version

[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability>
[^2] <https://arrow.apache.org/powered_by/>