Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: maybe_convert_numeric speedup #16104

Merged
merged 6 commits into from
Apr 25, 2017

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Apr 23, 2017

xref #16043

Probably more to be done, but picks up some easy performance on the int path. For some reason I'm having trouble running asv, here are some timings.

setup

ints = np.array(list(range(1000000)), dtype=object)
floats = np.array([float(x) for x in range(1000000)], dtype=object)
ints_with_float = ints.copy()
ints_with_float[-1] = 1.2

Master

In [2]: %timeit pd.to_numeric(floats)
10 loops, best of 3: 132 ms per loop

In [3]: %timeit pd.to_numeric(ints)
1 loop, best of 3: 437 ms per loop

In [4]: %timeit pd.to_numeric(ints_with_float)
1 loop, best of 3: 444 ms per loop

PR

In [5]: %timeit pd.to_numeric(floats)
10 loops, best of 3: 88.9 ms per loop

In [6]: %timeit pd.to_numeric(ints)
10 loops, best of 3: 57.6 ms per loop

In [7]: %timeit pd.to_numeric(ints_with_float)
1 loop, best of 3: 392 ms per loop

@chris-b1 chris-b1 added the Performance Memory or execution speed performance label Apr 23, 2017
uints[i] = as_int
if as_int <= iINT64_MAX:
ints[i] = as_int
if val >= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to

cdef ival <int> val
if ival >= 0:
     ...
if ival <= iINT64_max:
    ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't do exactly this because val could be larger than fits into a c int

@jreback
Copy link
Contributor

jreback commented Apr 23, 2017

the other thing is it might be worth it to actually try to cast once you have found an actual int (or a uint),

In [13]: (ints.astype('int64', casting='unsafe') == ints).all()
Out[13]: True

In [14]: %timeit (ints.astype('int64', casting='unsafe') == ints).all()
55.5 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

At this point it would raise if you encounter a float so this is ok (you can do it in a try/except).

not sure if this is totally feasible though.

@chris-b1
Copy link
Contributor Author

OK, I've added some fastpath logic that will try and convert the whole array at once - have to see if anything breaks!

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Apr 24, 2017
@TomAugspurger
Copy link
Contributor

This should go in 0.20, correct?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 24, 2017 via email

@TomAugspurger TomAugspurger modified the milestones: 0.20.0, 0.21.0 Apr 24, 2017
@codecov
Copy link

codecov bot commented Apr 24, 2017

Codecov Report

Merging #16104 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16104      +/-   ##
==========================================
- Coverage   90.84%   90.84%   -0.01%     
==========================================
  Files         159      159              
  Lines       50775    50768       -7     
==========================================
- Hits        46126    46118       -8     
- Misses       4649     4650       +1
Flag Coverage Δ
#multiple 88.62% <ø> (-0.01%) ⬇️
#single 40.33% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/util/testing.py 79.07% <0%> (-0.16%) ⬇️
pandas/api/types/__init__.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e501e1d...9757099. Read the comment docs.

@jreback jreback merged commit 008e9ec into pandas-dev:master Apr 25, 2017
@jreback
Copy link
Contributor

jreback commented Apr 25, 2017

nice! thanks @chris-b1

@jorisvandenbossche
Copy link
Member

Yeah, I put the 0.21 milestone on it as it was about performance, but forgot it was for a perf regressiono in 0.20dev. So certainly fine. Thanks Chris!

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants