Fixes for large datasets #229

mcsqr · 2023-08-23T14:19:16Z

Some minor but somewhat helpful changes:

This avoids invoking S.values which creates the dense matrix from S when sparse_s = True in the aggregate function. This avoids OOM errors for very large datasets (I've used it [the aggregate function] successfully on a 400k time-series dataset after the change).
It also removes a warning coming from using the deprecated sparse argument. Backwards compatibility with old sklearn libraries is maintained.

Excluded from this PR:

It gives a warning whenever Y_df has less time series than S_df at the end of the aggregate function, as current behaviour is quite unintuitive and leads to users' confusion.
[EDIT: Aaah, ok, it seems that this behaviour was introduced as a result of this PR here: [FIX] Aggregate unbalanced datasets #190, see also this issue: Broken aggregate function in main branch #211 . What was the reason again to move this line Y_bottom_df = Y_bottom_df.groupby(['unique_id', 'ds'])['y'].sum().reset_index() two lines up?]
[EDIT 2: Perhaps there should be an additional argument with an option whether to prepend the shorter time series with zeros, or not?]

…ate not instantiate dense S

review-notebook-app · 2023-08-23T14:19:20Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jmoralez · 2023-09-05T17:50:17Z

Hey. Sorry for taking so long to review, I'll take a look now.

The errors weren't exactly related to python 3.7 but rather scikit-learn >= 1.2, which was released in Dec 2022, not so long ago to stop supporting it. Can you please use something like the following instead?

try:
    encoder = OneHotEncoder(categories=categories, sparse_output=sparse_s, dtype=np.float32)
except TypeError:  # sklearn < 1.2
    encoder = OneHotEncoder(categories=categories, sparse=sparse_s, dtype=np.float32)

mcsqr · 2023-09-06T13:10:54Z

Hi Jose, yeah, thanks for the tip. I did it.

Note: the currently biggest point of discussion is the last bullet point in the PR description. It seems that the change you (Fede) have introduced in PR #190 is very unintuitive to users. I'm not sure what's the right way to handle this, but you have some issues open and I also had complaints from my collaborators. The warning I added is just a small plaster on a big wound. But this could also be taken out of this PR if desired.

jmoralez

Just some minor comments. Also can you please revert the changes to python 3.7? The library is still compatible with it and we want to support it for a little big longer.

hierarchicalforecast/utils.py

mcsqr · 2023-09-07T16:36:07Z

@jmoralez I think all done.

jmoralez

Thanks a lot! LGTM

mcsqr added 3 commits August 18, 2023 17:57

First attempt to allow sparse_s in aggregate_before and making aggreg…

ee1299b

…ate not instantiate dense S

Add a warning message if some series were removed in aggregate

7b1c6dc

Improve the condition for the warning

92b5515

AzulGarza self-requested a review August 25, 2023 17:40

mcsqr added 2 commits September 5, 2023 11:31

Drop Python 3.7

615d7ec

Drop Python 3.7 attempt 2

91dd6e4

make both old and new sklearn work nicely

2b38236

jmoralez requested changes Sep 6, 2023

View reviewed changes

hierarchicalforecast/utils.py Outdated Show resolved Hide resolved

hierarchicalforecast/utils.py Outdated Show resolved Hide resolved

hierarchicalforecast/utils.py Show resolved Hide resolved

mcsqr added 4 commits September 7, 2023 14:54

Bring back 3.7

19b117b

Remove the warning for dropped time series (moved to another branch)

ec3c868

avoid dense matrix in aggregate_before + small test of aggregate

fa3d544

some missing notebook cleanup

5a030a0

jmoralez approved these changes Sep 7, 2023

View reviewed changes

jmoralez merged commit af34f0e into Nixtla:main Sep 7, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for large datasets #229

Fixes for large datasets #229

mcsqr commented Aug 23, 2023 •

edited

Loading

review-notebook-app bot commented Aug 23, 2023

jmoralez commented Sep 5, 2023

mcsqr commented Sep 6, 2023

jmoralez left a comment

mcsqr commented Sep 7, 2023

jmoralez left a comment

Fixes for large datasets #229

Fixes for large datasets #229

Conversation

mcsqr commented Aug 23, 2023 • edited Loading

review-notebook-app bot commented Aug 23, 2023

jmoralez commented Sep 5, 2023

mcsqr commented Sep 6, 2023

jmoralez left a comment

Choose a reason for hiding this comment

mcsqr commented Sep 7, 2023

jmoralez left a comment

Choose a reason for hiding this comment

mcsqr commented Aug 23, 2023 •

edited

Loading