Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dataframe.groupby instead of iterating #61

Merged
merged 17 commits into from
Dec 2, 2019
Merged

Conversation

jvlmdr
Copy link
Collaborator

@jvlmdr jvlmdr commented Nov 30, 2019

I noticed that iteratively selecting rows from the dataframe was a serious bottleneck.

It looks like someone was already investigating this. I have removed the use of the cached analysis and the lines which computed timings.

I isolated the code for extracting counts and added a benchmark (and a dependency on pytest-benchmark).

Before:

--------------------------------------------------------- benchmark: 1 tests ---------------------------------------------------------
Name (time in s)                                  Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_extract_counts_from_df_map     15.4156  16.1166  15.6762  0.3507  15.4331  0.6114       1;0  0.0638       5           1
--------------------------------------------------------------------------------------------------------------------------------------

After (time in ms not s):

------------------------------------------------------------ benchmark: 1 tests ------------------------------------------------------------
Name (time in ms)                                  Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_extract_counts_from_df_map     146.5993  209.5080  175.8946  22.3510  174.9131  17.7610       2;0  5.6852       5           1
--------------------------------------------------------------------------------------------------------------------------------------------

@jvlmdr jvlmdr marked this pull request as ready for review November 30, 2019 16:15
@jvlmdr
Copy link
Collaborator Author

jvlmdr commented Nov 30, 2019

Sorry about the large number of commits. When the pull request was merged, the commits were flattened. I only pulled this collapsed-commit after doing the work. If you like, I can try to rebase it?

@jvlmdr jvlmdr closed this Nov 30, 2019
@jvlmdr
Copy link
Collaborator Author

jvlmdr commented Nov 30, 2019

Opening new pull request with rebased branch

@jvlmdr jvlmdr reopened this Nov 30, 2019
@cheind
Copy link
Owner

cheind commented Dec 2, 2019

Very cool, Jack! You are driving this project :) I guess we should think about making you a maintainer, since the time I can spend on professionally on this project has become quite limited. Interested?

@cheind cheind merged commit 382caea into cheind:develop Dec 2, 2019
@jvlmdr
Copy link
Collaborator Author

jvlmdr commented Dec 2, 2019

Thanks! Yep, I think I would be able to do that. Let's talk via email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants