Use dataframe.groupby instead of iterating #61

jvlmdr · 2019-11-30T16:15:11Z

I noticed that iteratively selecting rows from the dataframe was a serious bottleneck.

It looks like someone was already investigating this. I have removed the use of the cached analysis and the lines which computed timings.

I isolated the code for extracting counts and added a benchmark (and a dependency on pytest-benchmark).

Before:

--------------------------------------------------------- benchmark: 1 tests ---------------------------------------------------------
Name (time in s)                                  Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_extract_counts_from_df_map     15.4156  16.1166  15.6762  0.3507  15.4331  0.6114       1;0  0.0638       5           1
--------------------------------------------------------------------------------------------------------------------------------------

After (time in ms not s):

------------------------------------------------------------ benchmark: 1 tests ------------------------------------------------------------
Name (time in ms)                                  Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_extract_counts_from_df_map     146.5993  209.5080  175.8946  22.3510  174.9131  17.7610       2;0  5.6852       5           1
--------------------------------------------------------------------------------------------------------------------------------------------

Reproduces issue: cheind#49

Select non-empty elements before making hypothesis ID an index. This avoids "non-empty take from an empty axes" error.

DataFrame.loc[] may return DataFrame or Series.

Add corresponding test.

Speeds up construction of linear assignment problem.

Add benchmark using pytest-benchmark.

jvlmdr · 2019-11-30T16:17:29Z

Sorry about the large number of commits. When the pull request was merged, the commits were flattened. I only pulled this collapsed-commit after doing the work. If you like, I can try to rebase it?

jvlmdr · 2019-11-30T16:46:46Z

Opening new pull request with rebased branch

cheind · 2019-12-02T05:54:07Z

Very cool, Jack! You are driving this project :) I guess we should think about making you a maintainer, since the time I can spend on professionally on this project has become quite limited. Interested?

jvlmdr · 2019-12-02T12:19:53Z

Thanks! Yep, I think I would be able to do that. Let's talk via email.

cheind and others added 17 commits July 10, 2018 06:33

version bump

b56cd33

fix realease readme

ae83b5a

Merge branch 'release/1.1.3'

125304f

Add test for empty predictions

fbe6ed0

Reproduces issue: cheind#49

Add support for empty predictions.

08ab6d3

Select non-empty elements before making hypothesis ID an index. This avoids "non-empty take from an empty axes" error.

Pass list to loc[] to ensure DataFrame is returned

30e88c2

DataFrame.loc[] may return DataFrame or Series.

Add both-empty test. Re-factor tests.

a866ac4

Fix event dataframe creator for zero events

5fbcf16

Add corresponding test.

Merge branch 'master' into develop

0487b9c

Add _qdiv to produce nans silently

f5876ae

Use warnings.catch_warnings for suppression

a0d89d8

Use single groupby() call to count events

eb08800

Speeds up construction of linear assignment problem.

Fix mistake in row and column indexing

84487d7

Separate counts from computing metrics

54393a4

Add benchmark using pytest-benchmark.

Merge remote-tracking branch 'cheindl/develop' into develop

bfb3b62

Merge branch 'dataframe' into develop

aebc2c3

Remove commented-out lines

e931a9e

jvlmdr marked this pull request as ready for review November 30, 2019 16:15

jvlmdr closed this Nov 30, 2019

jvlmdr reopened this Nov 30, 2019

cheind merged commit 382caea into cheind:develop Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dataframe.groupby instead of iterating #61

Use dataframe.groupby instead of iterating #61

jvlmdr commented Nov 30, 2019

jvlmdr commented Nov 30, 2019

jvlmdr commented Nov 30, 2019

cheind commented Dec 2, 2019

jvlmdr commented Dec 2, 2019

Use dataframe.groupby instead of iterating #61

Use dataframe.groupby instead of iterating #61

Conversation

jvlmdr commented Nov 30, 2019

jvlmdr commented Nov 30, 2019

jvlmdr commented Nov 30, 2019

cheind commented Dec 2, 2019

jvlmdr commented Dec 2, 2019