Unused categories leads to wrong color and legend #189

hadim · 2025-02-26T16:03:51Z

When a column contains unused categories, then the scatter plot colors and legend are wrong. Calling df.cat.remove_unused_categories() resolve the issue, and I wonder whether jscatter should call it as well?

See the code to reproduce:

import pandas as pd
import jscatter
import numpy as np


def keep_largest_categories(column: pd.Series, threshold: int) -> pd.Series:
    """Keep on the categories with a values count larger than a threshold"""

    # 0. Make a copy of the column
    column = column.copy()

    # 1. Calculate the value counts for the specified column
    category_counts = column.value_counts()

    # 2. Identify categories with counts below the threshold
    low_count_categories = category_counts[category_counts < threshold].index

    # 3. Create a boolean mask to identify rows where the category is in low_count_categories
    mask = column.isin(low_count_categories)

    # 4. Use the mask to set values in the specified column to NaN
    column[mask] = None

    return column


n = 50
categories = [f"cat_{i}" for i in range(300)]
categories = pd.Categorical(categories)

df = pd.DataFrame({
    "x": np.random.rand(n),
    "y": np.random.rand(n),
    "cat": np.random.choice(categories, size=n),
})

df["cat"] = df["cat"].astype("category")

# "cat" contains 300 categories
# now we only keep from cat_0 to cat_10
df["cat"] = keep_largest_categories(df["cat"], 2)

# if you dont remove the unused categories then the color and legend will be wrong
# df["cat"] = df["cat"].cat.remove_unused_categories()

scatter = jscatter.Scatter(
    data=df,
    x="x",
    y="y",
    color_by="cat",
    size=10,
    legend=True,
    tooltip=True,
    tooltip_properties=["cat"],
    tooltip_size="large",
    height=200,
    width=500,
    legend_size="large",
)

scatter.show()

without `df.cat.remove_unused_categories()`

with `df.cat.remove_unused_categories()`

The text was updated successfully, but these errors were encountered:

hadim · 2025-02-26T16:16:42Z

Note that calling remove_unused_categories() will actually modify the input dataframe, which might not be the desired behavior.

Ideally, the coloring should work without having to call remove_unused_categories().

flekschas added the bug Something isn't working label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused categories leads to wrong color and legend #189

Unused categories leads to wrong color and legend #189

hadim commented Feb 26, 2025

hadim commented Feb 26, 2025

Unused categories leads to wrong color and legend #189

Unused categories leads to wrong color and legend #189

Comments

hadim commented Feb 26, 2025

without df.cat.remove_unused_categories()

with df.cat.remove_unused_categories()

hadim commented Feb 26, 2025

without `df.cat.remove_unused_categories()`

with `df.cat.remove_unused_categories()`