Remove unused levels from categorical `unique_id` #473

nickto · 2023-04-21T11:25:40Z

Summary

Check if unique_id is categorical and drop unused levels.

Why?

If unique_id is categorical, and some levels were removed, then value_counts() in _grouped_array_from_df() still returns these levels, but with counts of 0.

_grouped_array_from_df(), however, does not expect that some counts could be zero, therefore, it makes sense to drop unused levels from a categorical index.

This situation arises, for example, when we

Use a categorical unique_id, e.g., to save memory when a dataset is large.
Sub-sample only some of the unique_id, e.g., for faster development, because the data set is large.

Feel free to use your judgment on whether this edge case is common enough to justify an additional validation.

Test

Example code, which fails before the fix:

from datetime import date, timedelta

import numpy as np
import pandas as pd

from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

T = 100
N_GROUPS = 3
N_SAMPLE = 2

np.random.seed(2023)

# Generate data
df = pd.DataFrame({
    "unique_id": np.repeat([chr(ord("a") + i) for i in range(N_GROUPS)], T),
    "ds": [date(year=2023, month=1, day=1) + timedelta(days=i) for i in range(T)] * N_GROUPS,
    "y": np.random.normal(0, 1, (N_GROUPS * T))
})

# Cast `unique_id` column to category type
df["unique_id"] = df["unique_id"].astype("category")

# Sample only some of the unique IDs
unique_ids = list(set(df["unique_id"]))
sampled_ids = np.random.choice(unique_ids, 2, replace=False)
df = df.loc[df["unique_id"].isin(sampled_ids)]

# Forecast
sf = StatsForecast(
    df=df,
    models=[AutoARIMA()],
    freq="D",
)
sf.forecast(h=5)

Throws the following error, which does not help identifying the cause of the problem:

Error: ValueError: zero-size array to reduction operation minimum which has no identity

review-notebook-app · 2023-04-21T11:25:44Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

CLAassistant · 2023-04-21T11:25:45Z

All committers have signed the CLA.

If `unique_id` is categorical, and some levels were removed, then `value_counts()` in `_grouped_array_from_df()` still returns these levels in the output with their counts being 0. `_grouped_array_from_df()`, however, does not expect that some counts could be zero, therefore, it makes sense to drop unused levels from a categorical index.

AzulGarza · 2023-04-21T18:53:45Z

Hey @nickto! This was a problem we had a long time ago. Thank you very much for solving it! 🙌

LGTM

AzulGarza · 2023-04-21T20:26:37Z

@all-contributors please add @nickto for code

allcontributors · 2023-04-21T20:26:46Z

@FedericoGarza

I've put up a pull request to add @nickto! 🎉

nickto force-pushed the fix-categorical branch from e638c8a to 3151db5 Compare April 21, 2023 11:29

nickto force-pushed the fix-categorical branch from 3151db5 to 9bd568b Compare April 21, 2023 11:30

AzulGarza self-requested a review April 21, 2023 18:47

AzulGarza approved these changes Apr 21, 2023

View reviewed changes

AzulGarza merged commit 672015c into Nixtla:main Apr 21, 2023

allcontributors bot mentioned this pull request Apr 21, 2023

docs: add nickto as a contributor for code #475

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unused levels from categorical `unique_id` #473

Remove unused levels from categorical `unique_id` #473

nickto commented Apr 21, 2023

review-notebook-app bot commented Apr 21, 2023

CLAassistant commented Apr 21, 2023 •

edited

Loading

AzulGarza commented Apr 21, 2023

AzulGarza commented Apr 21, 2023

allcontributors bot commented Apr 21, 2023

Remove unused levels from categorical unique_id #473

Remove unused levels from categorical unique_id #473

Conversation

nickto commented Apr 21, 2023

Summary

Why?

Test

review-notebook-app bot commented Apr 21, 2023

CLAassistant commented Apr 21, 2023 • edited Loading

AzulGarza commented Apr 21, 2023

AzulGarza commented Apr 21, 2023

allcontributors bot commented Apr 21, 2023

Remove unused levels from categorical `unique_id` #473

Remove unused levels from categorical `unique_id` #473

CLAassistant commented Apr 21, 2023 •

edited

Loading