Improve support for pandas Extension Arrays (#10301) #10380

richard-berg · 2025-05-30T12:36:14Z

The core ideas here are:

Provide an array-api implementation of np.result_type(*arrays_or_dtypes). This unlocks arbitrary N-ary operations on ExtensionArrays without loss of type info (as found in pre-2024 releases) or blowing up due to lack of EA-specific implementations (as documented in Regression in DataArrays created from Pandas #10301).
Provide said implementations of operators and ufuncs by delegating to pd.Series

Minor refactors & bugfixes are documented inline.

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

welcome · 2025-05-30T12:36:17Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

richard-berg · 2025-05-30T12:50:30Z

xarray/core/dtypes.py

-def preprocess_types(t):
-    if isinstance(t, str | bytes):
-        return type(t)
-    elif isinstance(dtype := getattr(t, "dtype", t), np.dtype) and (
-        np.issubdtype(dtype, np.str_) or np.issubdtype(dtype, np.bytes_)
-    ):
+def maybe_promote_to_variable_width(
+    array_or_dtype: np.typing.ArrayLike | np.typing.DTypeLike,
+) -> np.typing.ArrayLike | np.typing.DTypeLike:
+    if isinstance(array_or_dtype, str | bytes):
+        return type(array_or_dtype)
+    elif isinstance(
+        dtype := getattr(array_or_dtype, "dtype", array_or_dtype), np.dtype
+    ) and (np.issubdtype(dtype, np.str_) or np.issubdtype(dtype, np.bytes_)):
        # drop the length from numpy's fixed-width string dtypes, it is better to
        # recalculate
        # TODO(keewis): remove once the minimum version of `numpy.result_type` does this
        # for us
        return dtype.type
    else:
-        return t
+        return array_or_dtype


This diff looks ugly, but it's simply renaming the fn + its argument.

note that this does not promote to variable width strings (which would be the new string dtype, numpy.dtypes.StringDType) but rather drops the width of a existing fixed-width string dtype to force numpy to recalculate the width. The aim is to avoid truncating a python string object.

Additionally, it might be good to keep the "type" in the name of the function, since it only operates on string dtypes.

richard-berg · 2025-05-30T12:52:17Z

xarray/core/dtypes.py

+        except TypeError:
+            # passing individual objects to xp.result_type means NEP-18 implementations won't have
+            # a chance to intercept special values (such as NA) that numpy core cannot handle
+            pass


Only this except block is new. The rest of this fn was lifted as-is from result_type() below.

richard-berg · 2025-05-30T12:55:40Z

xarray/core/duck_array_ops.py

-    if any(is_extension_array_dtype(x) for x in scalars_or_arrays):
-        extension_array_types = [
-            x.dtype for x in scalars_or_arrays if is_extension_array_dtype(x)
-        ]
-        if len(extension_array_types) == len(scalars_or_arrays) and all(
-            isinstance(x, type(extension_array_types[0])) for x in extension_array_types
-        ):
-            return scalars_or_arrays
-        raise ValueError(
-            "Cannot cast arrays to shared type, found"
-            f" array types {[x.dtype for x in scalars_or_arrays]}"
-        )
-


Because we now provide an array-api implementation of np.result_type, we no longer need these special cases. (which were far too special, IMO; the cases where we raised ValueError are perfectly valid)

Thanks for taking this on, it's great work!

Can you extract out any non-EA changes to duck_array_ops.py and dtypes.py and make a separate PR please? It will be far easier to review then

I am happy to open a separate PR (and will for other reasons), but this is actually pretty clean in terms of what it does. But there are other issues, so I will open a PR.

ilan-gold

@dcherian I would like to hear from @richard-berg about this. Aside from these new ops, everything else in this PR appears to be directly relevant.

By using pandas (maybe private) machinery directly, this PR seems to handle a much more broad set of use-cases than what we used to, and integrates with existing xarray functionality internally in a valid way (via duck array ops / NEP 18). I never felt confident enough/it necessary to go "this deep," especially with duck array ops, but I think this change is super nice and clean and really makes things clear so I'm a fan.

The big questions is what the status of this internal pandas machinery is.

I'll give @richard-berg a day or two to reply, and then start fiddling around with the PR a bit myself to confirm some of these things myself.

Big thanks @richard-berg because this is quite clean and nice!

ilan-gold · 2025-06-05T11:17:31Z

xarray/core/extension_array.py

+from pandas.api.types import is_scalar as pd_is_scalar
+from pandas.core.dtypes.astype import astype_array_safe
+from pandas.core.dtypes.cast import find_result_type
+from pandas.core.dtypes.concat import concat_compat


So one reason I was not aware of some of these functions is that they are internal to pandas and do not appear in the public documentation.

What is the policy on this? Are these functions "public" in some other sense in terms of stability?

@dcherian @keewis We need some direction here - are we ok taking non-public functions?

preferably not if we can avoid it. I could see it being unavoidable here though. Is there an upstream policy around public API for extension arrays?

I don't see any tests for the behavior it is intended to catch i.e., type promotion, so I'm not entirely sure if the difference between this PR and the one I have (#10304) in this sense is meaningful although I could imagine concatenating an int64 and int32 yielding (ideally) and int64. I'll try to come up with a test

Is there an upstream policy around public API for extension arrays?

I have never seen anything and these functions are not array-api specific (although I guess most everything in pandas is at this point). Maybe I'm aware of something, but the question isn't specific enough. What sort of policy would you envision finding? Rules about concatenation/type casting?

from pandas.api.types import is_scalar as pd_is_scalar from pandas.core.dtypes.astype import astype_array_safe from pandas.core.dtypes.cast import find_result_type from pandas.core.dtypes.concat import concat_compat

Can we rely on these functions being present? Or are they considered private API?

Can we rely on these functions being present? Or are they considered private API?

I think it is general python practice to say "documented = public" and these are not documented in the public pandas API.

I was just asking if you guys knew anything because pandas maintainers lurk around these parts and might have had something different to say.

I try not to be too strong in my statements because I don't know everything, especially about other people's libraries, but my instinct here was simply "these are not public, so we probably shouldn't use them" (see https://peps.python.org/pep-0008/#public-and-internal-interfaces as a reference).

I was just leaving the possibility open that maybe they knew something, or perhaps had an issue, that would contradict my "documented = public" assumption in this specific case.

ilan-gold · 2025-06-05T11:25:04Z

xarray/core/extension_array.py

+        or not subok
+        or order != "K"
+    ):
+        return NotImplemented


Ah good catch! https://numpy.org/neps/nep-0018-array-function-protocol.html#example-for-a-project-implementing-the-numpy-api

ilan-gold · 2025-06-05T11:29:28Z

xarray/core/extension_array.py

            raise TypeError(f"{array} is not an pandas ExtensionArray.")
        self.array = array

+        self._add_ops_dunders()
+
+    def _add_ops_dunders(self):


This is unrelated to the fix for reindex, I believe?

ilan-gold · 2025-06-05T11:30:39Z

xarray/core/duck_array_ops.py

+    elif is_extension_array_dtype(dtype):
+        # data may or may not be an ExtensionArray, so we can't rely on
+        # np.asarray to call our NEP-18 handler; gotta hook it ourselves
+        converted = PandasExtensionArray(as_extension_array(data, dtype))


Suggested change

converted = PandasExtensionArray(as_extension_array(data, dtype))

converted = PandasExtensionArray(np.asarray(data, dtype))

Isn't this what as_extension_array is meant for really i.e., to be used by the above suggested API?

ilan-gold · 2025-06-05T11:36:26Z

xarray/core/extension_array.py

@@ -29,6 +54,97 @@ def __extension_duck_array__issubdtype(
    return False  # never want a function to think a pandas extension dtype is a subtype of numpy


+@implements("astype")  # np.astype was added in 2.1.0, but we only require >=1.24


What does this mean practically for xarray? Do we need a shim here?

ilan-gold

My biggest comment is that it appears the reindexing test was not working. For example, the __extension_duck_array__result_type was not filtering out nans from the scalars and categoricals cannot have nans as a type. So I think this needs more testing.

ilan-gold · 2025-06-11T12:51:34Z

xarray/core/extension_array.py

+from pandas.api.types import is_scalar as pd_is_scalar
+from pandas.core.dtypes.astype import astype_array_safe
+from pandas.core.dtypes.cast import find_result_type
+from pandas.core.dtypes.concat import concat_compat


@dcherian @keewis We need some direction here - are we ok taking non-public functions?

ilan-gold · 2025-06-11T13:13:25Z

xarray/tests/test_dataarray.py

+        assert x.index.dtype == y.index.dtype == np.dtype("int64")
+
+    def test_reindex_categorical(self) -> None:
+        index1 = pd.Categorical(["a", "b", "c"])


I believe this test was going through a different codepath than the one intended, actually. Categorical indices are handled differenlty if I'm not mistaken

ilan-gold · 2025-06-11T13:51:30Z

xarray/core/duck_array_ops.py

-    if any(is_extension_array_dtype(x) for x in scalars_or_arrays):
-        extension_array_types = [
-            x.dtype for x in scalars_or_arrays if is_extension_array_dtype(x)
-        ]
-        if len(extension_array_types) == len(scalars_or_arrays) and all(
-            isinstance(x, type(extension_array_types[0])) for x in extension_array_types
-        ):
-            return scalars_or_arrays
-        raise ValueError(
-            "Cannot cast arrays to shared type, found"
-            f" array types {[x.dtype for x in scalars_or_arrays]}"
-        )
-


I am happy to open a separate PR (and will for other reasons), but this is actually pretty clean in terms of what it does. But there are other issues, so I will open a PR.

ilan-gold

At the moment:

    @pytest.mark.parametrize(
        "fill_value,extension_array",
        [
            pytest.param("a", pd.Categorical([pd.NA, "a", "b"]), id="categorical"),
        ]
    )
    def test_fillna_extension_array(self, fill_value, extension_array) -> None:
        srs: pd.Series = pd.Series(index=np.array([1, 2, 3]), data=extension_array)
        da = srs.to_xarray()
        filled = da.fillna(fill_value)
        assert filled.dtype == srs.dtype
        assert (filled.values == np.array([fill_value, *(srs.values[1:])])).all()

fails on this branch because the current code only handles types of scalars and we really need scalars themselves for result_type otherwise we end up casting to object too aggressively.

I'm working on a branch now that has some more tests, is a bit more minimal, and doesn't use private APIs

Improve support for pandas Extension Arrays (pydata#10301)

1db86e8

github-actions bot added topic-documentation topic-arrays related to flexible array support labels May 30, 2025

richard-berg mentioned this pull request May 30, 2025

Regression in DataArrays created from Pandas #10301

Open

5 tasks

richard-berg commented May 30, 2025

View reviewed changes

dcherian requested a review from keewis May 30, 2025 14:51

ilan-gold reviewed Jun 5, 2025

View reviewed changes

ilan-gold reviewed Jun 11, 2025

View reviewed changes

ilan-gold reviewed Jun 16, 2025

View reviewed changes

	converted = PandasExtensionArray(as_extension_array(data, dtype))
	converted = PandasExtensionArray(np.asarray(data, dtype))

		@@ -29,6 +54,97 @@ def __extension_duck_array__issubdtype(
		return False # never want a function to think a pandas extension dtype is a subtype of numpy


		@implements("astype") # np.astype was added in 2.1.0, but we only require >=1.24

Uh oh!

Improve support for pandas Extension Arrays (#10301) #10380

Are you sure you want to change the base?

Improve support for pandas Extension Arrays (#10301) #10380

Conversation

richard-berg commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

welcome bot commented May 30, 2025

Uh oh!

richard-berg May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richard-berg commented May 30, 2025 •

edited

Loading

richard-berg May 30, 2025 •

edited

Loading

ilan-gold left a comment •

edited

Loading

ilan-gold Jun 11, 2025 •

edited

Loading

ilan-gold Jun 13, 2025 •

edited

Loading

ilan-gold Jun 5, 2025 •

edited

Loading