Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unused levels from categorical unique_id #473

Merged
merged 1 commit into from
Apr 21, 2023

Conversation

nickto
Copy link
Contributor

@nickto nickto commented Apr 21, 2023

Summary

Check if unique_id is categorical and drop unused levels.

Why?

If unique_id is categorical, and some levels were removed, then value_counts() in _grouped_array_from_df() still returns these levels, but with counts of 0.

_grouped_array_from_df(), however, does not expect that some counts could be zero, therefore, it makes sense to drop unused levels from a categorical index.

This situation arises, for example, when we

  1. Use a categorical unique_id, e.g., to save memory when a dataset is large.
  2. Sub-sample only some of the unique_id, e.g., for faster development, because the data set is large.

Feel free to use your judgment on whether this edge case is common enough to justify an additional validation.

Test

Example code, which fails before the fix:

from datetime import date, timedelta

import numpy as np
import pandas as pd

from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

T = 100
N_GROUPS = 3
N_SAMPLE = 2

np.random.seed(2023)

# Generate data
df = pd.DataFrame({
    "unique_id": np.repeat([chr(ord("a") + i) for i in range(N_GROUPS)], T),
    "ds": [date(year=2023, month=1, day=1) + timedelta(days=i) for i in range(T)] * N_GROUPS,
    "y": np.random.normal(0, 1, (N_GROUPS * T))
})

# Cast `unique_id` column to category type
df["unique_id"] = df["unique_id"].astype("category")

# Sample only some of the unique IDs
unique_ids = list(set(df["unique_id"]))
sampled_ids = np.random.choice(unique_ids, 2, replace=False)
df = df.loc[df["unique_id"].isin(sampled_ids)]

# Forecast
sf = StatsForecast(
    df=df,
    models=[AutoARIMA()],
    freq="D",
)
sf.forecast(h=5)

Throws the following error, which does not help identifying the cause of the problem:

Error: ValueError: zero-size array to reduction operation minimum which has no identity

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@CLAassistant
Copy link

CLAassistant commented Apr 21, 2023

CLA assistant check
All committers have signed the CLA.

If `unique_id` is categorical, and some levels were removed, then
`value_counts()` in `_grouped_array_from_df()` still returns these
levels in the output with their counts being 0.

`_grouped_array_from_df()`, however, does not expect that some counts
could be zero, therefore, it makes sense to drop unused levels from a
categorical index.
@AzulGarza AzulGarza self-requested a review April 21, 2023 18:47
@AzulGarza
Copy link
Member

Hey @nickto! This was a problem we had a long time ago. Thank you very much for solving it! 🙌

LGTM

@AzulGarza AzulGarza merged commit 672015c into Nixtla:main Apr 21, 2023
@AzulGarza
Copy link
Member

@all-contributors please add @nickto for code

@allcontributors
Copy link
Contributor

@FedericoGarza

I've put up a pull request to add @nickto! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants