Skip to content

Unused categories leads to wrong color and legend #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hadim opened this issue Feb 26, 2025 · 1 comment
Open

Unused categories leads to wrong color and legend #189

hadim opened this issue Feb 26, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@hadim
Copy link

hadim commented Feb 26, 2025

When a column contains unused categories, then the scatter plot colors and legend are wrong. Calling df.cat.remove_unused_categories() resolve the issue, and I wonder whether jscatter should call it as well?

See the code to reproduce:

import pandas as pd
import jscatter
import numpy as np


def keep_largest_categories(column: pd.Series, threshold: int) -> pd.Series:
    """Keep on the categories with a values count larger than a threshold"""

    # 0. Make a copy of the column
    column = column.copy()

    # 1. Calculate the value counts for the specified column
    category_counts = column.value_counts()

    # 2. Identify categories with counts below the threshold
    low_count_categories = category_counts[category_counts < threshold].index

    # 3. Create a boolean mask to identify rows where the category is in low_count_categories
    mask = column.isin(low_count_categories)

    # 4. Use the mask to set values in the specified column to NaN
    column[mask] = None

    return column


n = 50
categories = [f"cat_{i}" for i in range(300)]
categories = pd.Categorical(categories)

df = pd.DataFrame({
    "x": np.random.rand(n),
    "y": np.random.rand(n),
    "cat": np.random.choice(categories, size=n),
})

df["cat"] = df["cat"].astype("category")

# "cat" contains 300 categories
# now we only keep from cat_0 to cat_10
df["cat"] = keep_largest_categories(df["cat"], 2)

# if you dont remove the unused categories then the color and legend will be wrong
# df["cat"] = df["cat"].cat.remove_unused_categories()

scatter = jscatter.Scatter(
    data=df,
    x="x",
    y="y",
    color_by="cat",
    size=10,
    legend=True,
    tooltip=True,
    tooltip_properties=["cat"],
    tooltip_size="large",
    height=200,
    width=500,
    legend_size="large",
)

scatter.show()

without df.cat.remove_unused_categories()

Image

with df.cat.remove_unused_categories()

Image
@hadim
Copy link
Author

hadim commented Feb 26, 2025

Note that calling remove_unused_categories() will actually modify the input dataframe, which might not be the desired behavior.

Ideally, the coloring should work without having to call remove_unused_categories().

@flekschas flekschas added the bug Something isn't working label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants