Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for large datasets #229

Merged
merged 10 commits into from
Sep 7, 2023
Merged

Conversation

mcsqr
Copy link
Contributor

@mcsqr mcsqr commented Aug 23, 2023

Some minor but somewhat helpful changes:

  • This avoids invoking S.values which creates the dense matrix from S when sparse_s = True in the aggregate function. This avoids OOM errors for very large datasets (I've used it [the aggregate function] successfully on a 400k time-series dataset after the change).
  • It also removes a warning coming from using the deprecated sparse argument. Backwards compatibility with old sklearn libraries is maintained.

Excluded from this PR:

  • It gives a warning whenever Y_df has less time series than S_df at the end of the aggregate function, as current behaviour is quite unintuitive and leads to users' confusion.
    [EDIT: Aaah, ok, it seems that this behaviour was introduced as a result of this PR here: [FIX] Aggregate unbalanced datasets #190, see also this issue: Broken aggregate function in main branch #211 . What was the reason again to move this line Y_bottom_df = Y_bottom_df.groupby(['unique_id', 'ds'])['y'].sum().reset_index() two lines up?]
    [EDIT 2: Perhaps there should be an additional argument with an option whether to prepend the shorter time series with zeros, or not?]

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@AzulGarza AzulGarza self-requested a review August 25, 2023 17:40
@jmoralez
Copy link
Member

jmoralez commented Sep 5, 2023

Hey. Sorry for taking so long to review, I'll take a look now.

The errors weren't exactly related to python 3.7 but rather scikit-learn >= 1.2, which was released in Dec 2022, not so long ago to stop supporting it. Can you please use something like the following instead?

try:
    encoder = OneHotEncoder(categories=categories, sparse_output=sparse_s, dtype=np.float32)
except TypeError:  # sklearn < 1.2
    encoder = OneHotEncoder(categories=categories, sparse=sparse_s, dtype=np.float32)

@mcsqr
Copy link
Contributor Author

mcsqr commented Sep 6, 2023

Hi Jose, yeah, thanks for the tip. I did it.

Note: the currently biggest point of discussion is the last bullet point in the PR description. It seems that the change you (Fede) have introduced in PR #190 is very unintuitive to users. I'm not sure what's the right way to handle this, but you have some issues open and I also had complaints from my collaborators. The warning I added is just a small plaster on a big wound. But this could also be taken out of this PR if desired.

Copy link
Member

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments. Also can you please revert the changes to python 3.7? The library is still compatible with it and we want to support it for a little big longer.

hierarchicalforecast/utils.py Outdated Show resolved Hide resolved
hierarchicalforecast/utils.py Outdated Show resolved Hide resolved
hierarchicalforecast/utils.py Show resolved Hide resolved
@mcsqr
Copy link
Contributor Author

mcsqr commented Sep 7, 2023

@jmoralez I think all done.

Copy link
Member

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! LGTM

@jmoralez jmoralez merged commit af34f0e into Nixtla:main Sep 7, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants