Skip to content

Conversation

@jaimefrio
Copy link
Contributor

Added logic to rolling_var to detect windows where all non-NaN
values are identical.

Need to assess both correctness and performance impact.

pandas/algos.pyx Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are nobs and nrep doubles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nobs was a double before my changes (see here). I seem to recall that I tried changing it to Py_ssize_t and there was some performance degradation, so I let it be. But I may very well be making this up, don't remember the exact details...

Since the effect of this new modifications on performance have to be measured, I'll try things both ways and add some information on the commit message if it stays a double.

@jaimefrio
Copy link
Contributor Author

I have refactored the code quite a bit, see what you think of this, @seth-p. The good news is that the performance differences seems negligible, and may even favor the new implementation, which I really find hard to understand. Following are timings over a random array of size double values, using a window of window items. Timed values are the minimum of enough repeats to take 0.1 sec. total time:

                                       master         new
size:     100, window:    10, time: 4.06e-05 sec. 4.06e-05 sec. 
size:     100, window:    30, time: 4.23e-05 sec. 4.19e-05 sec. 
size:     300, window:    10, time: 4.68e-05 sec. 4.56e-05 sec. 
size:     300, window:    30, time: 4.39e-05 sec. 3.98e-05 sec. 
size:     300, window:   100, time: 5.34e-05 sec. 4.23e-05 sec. 
size:    1000, window:    10, time: 9.48e-05 sec. 8.66e-05 sec. 
size:    1000, window:    30, time: 8.01e-05 sec. 9.36e-05 sec. 
size:    1000, window:   100, time: 7.76e-05 sec. 1.22e-04 sec. 
size:    1000, window:   300, time: 7.72e-05 sec. 1.08e-04 sec. 
size:    3000, window:    10, time: 1.34e-04 sec. 1.63e-04 sec. 
size:    3000, window:    30, time: 1.77e-04 sec. 1.38e-04 sec. 
size:    3000, window:   100, time: 1.67e-04 sec. 1.34e-04 sec. 
size:    3000, window:   300, time: 1.63e-04 sec. 1.59e-04 sec. 
size:    3000, window:  1000, time: 1.29e-04 sec. 1.47e-04 sec. 
size:   10000, window:    10, time: 4.34e-04 sec. 4.12e-04 sec. 
size:   10000, window:    30, time: 4.39e-04 sec. 3.18e-04 sec. 
size:   10000, window:   100, time: 3.55e-04 sec. 2.82e-04 sec. 
size:   10000, window:   300, time: 3.59e-04 sec. 2.89e-04 sec. 
size:   10000, window:  1000, time: 4.66e-04 sec. 3.09e-04 sec. 
size:   10000, window:  3000, time: 4.58e-04 sec. 3.64e-04 sec. 
size:   30000, window:    10, time: 1.25e-03 sec. 1.02e-03 sec. 
size:   30000, window:    30, time: 8.98e-04 sec. 7.80e-04 sec. 
size:   30000, window:   100, time: 8.39e-04 sec. 1.09e-03 sec. 
size:   30000, window:   300, time: 8.08e-04 sec. 8.19e-04 sec. 
size:   30000, window:  1000, time: 1.06e-03 sec. 6.99e-04 sec. 
size:   30000, window:  3000, time: 7.99e-04 sec. 6.68e-04 sec. 
size:   30000, window: 10000, time: 6.61e-04 sec. 6.29e-04 sec. 
size:  100000, window:    10, time: 2.46e-03 sec. 2.25e-03 sec. 
size:  100000, window:    30, time: 2.83e-03 sec. 2.29e-03 sec. 
size:  100000, window:   100, time: 2.63e-03 sec. 2.44e-03 sec. 
size:  100000, window:   300, time: 2.50e-03 sec. 2.54e-03 sec. 
size:  100000, window:  1000, time: 2.41e-03 sec. 2.61e-03 sec. 
size:  100000, window:  3000, time: 2.63e-03 sec. 2.34e-03 sec. 
size:  100000, window: 10000, time: 2.38e-03 sec. 2.24e-03 sec. 
size:  300000, window:    10, time: 8.60e-03 sec. 8.85e-03 sec. 
size:  300000, window:    30, time: 8.44e-03 sec. 7.96e-03 sec. 
size:  300000, window:   100, time: 8.74e-03 sec. 8.11e-03 sec. 
size:  300000, window:   300, time: 8.72e-03 sec. 8.40e-03 sec. 
size:  300000, window:  1000, time: 8.45e-03 sec. 8.09e-03 sec. 
size:  300000, window:  3000, time: 8.76e-03 sec. 8.42e-03 sec. 
size:  300000, window: 10000, time: 9.08e-03 sec. 8.51e-03 sec. 
size: 1000000, window:    10, time: 3.17e-02 sec. 3.06e-02 sec. 
size: 1000000, window:    30, time: 3.21e-02 sec. 2.97e-02 sec. 
size: 1000000, window:   100, time: 3.40e-02 sec. 2.82e-02 sec. 
size: 1000000, window:   300, time: 3.17e-02 sec. 3.05e-02 sec. 
size: 1000000, window:  1000, time: 2.93e-02 sec. 3.12e-02 sec. 
size: 1000000, window:  3000, time: 2.94e-02 sec. 2.75e-02 sec. 
size: 1000000, window: 10000, time: 3.34e-02 sec. 2.81e-02 sec.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

can you add a test that validates this new behaviour (well not new, but 'correct')?

@jreback jreback added this to the 0.15.0 milestone Aug 5, 2014
@jaimefrio
Copy link
Contributor Author

It may take a couple of days until I get around to it, but if we think this is worth adding I will definitely write some tests for it.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

sure no problem

I think definitely worthwhile avoiding pathological cases (because of rounding / precision issues) if possible (as u have shown)

@seth-p
Copy link
Contributor

seth-p commented Aug 5, 2014

Agreed, I think it's worth handling degenerate cases accurately.

pandas/algos.pyx Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps val = 0 if (nobs > ddof) else NaN?

Added logic to `rolling_var` to detect windows where all non-NaN
values are identical.

Need to assess both correctness and performance impact.
@jreback
Copy link
Contributor

jreback commented Sep 10, 2014

how's this coming?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

@jaimefrio status?

@jaimefrio
Copy link
Contributor Author

I was having some rebase problems last night, but it is mostly ready, Will get back to it tonight and push a hopefully final version.

@jaimefrio
Copy link
Contributor Author

Couldn't solve my rebase issues, so created a new PR #8271. Closing this

@jaimefrio jaimefrio closed this Sep 15, 2014
@jaimefrio jaimefrio deleted the zero_var_detection branch September 16, 2014 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants