-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modest performance, address #12647 #12656
Conversation
@kshedden you prob know this, but all of the perf issues have to do with going back-forth between cython/python. the ideal would be to put the entire loop in cython, rather than in python, then a call to parse a single line in cython. |
Actually I didn't know that there was that much function call overhead for On Thu, Mar 17, 2016 at 9:22 AM, Jeff Reback notifications@github.com
|
yeah because everything has to be checked every time. Its not what you are passing, but the function call itself. You should be able move the entire loop and will get pretty good speedups. |
This should resolve #12659, #12654, #12647. Also adds test coverage with two new files. The tests pass on most setups but there is one coredump on Travis. I'm not sure it that is related to this PR. There are also some performance enhancements here in the 2x-4x range. @gdementen @jreback @ywhuofu @benjello @raderaj |
break | ||
|
||
|
||
def _readline(parser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a cdef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef bint _readline
def process_byte_array_with_data(parser, int offset, int length, np.ndarray[uint8_t, ndim=2] byte_chunk, | ||
np.ndarray[dtype=object, ndim=2] string_chunk): | ||
|
||
def _do_read(parser, int nrows): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't have to, but its good to do: _do_read(object parser, int nrows):
@jreback I'm having trouble working with object dtype arrays in cython, would appreciate your help. I have the following code, where I have also tried typing
|
docs are here I think you want something like:
so your note that you are always working with bytes, and NOT unicode/str. you can decode things later. HTH, and lmk. |
@jreback I think this is ready now, known bugs and performance issues should be resolved |
can u post perf comparison (also haven't looked but do we have asv for this?) |
We don't have an asv, I tried setting one up but failed (I don't have a continuum distro and couldn't get it to work with virtualenv). A rough timing on a 100K file: the released version (v0.18) takes around 11 seconds, this version takes around 5 seconds. However csv reading is about 50x faster, so still a ways to go. |
how this coming? |
I think it is ready. to go, the known bugs are fixed and the performance is at least somewhat better. |
@@ -191,7 +191,7 @@ Deprecations | |||
Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
|
|||
- Improved speed of SAS reader (PR 12656) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write the issue number like others (it doesn't matter that its a PR number)
a PEP checked failed
|
did u look at the refactor I posted? |
Do you mean this one? On Thu, Apr 21, 2016 at 7:52 PM, Jeff Reback notifications@github.com
|
Thanks, runs clean locally, waiting on travis now. The only change I made is that I left a few things in the decompressors int type due to overflow potential. |
btw I only had a really small benchmark file |
# Loop until a data row is read | ||
while True: | ||
if self.parser._current_page_type == const.page_meta_type: | ||
flag = (self.current_row_on_page_index >= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there is a fair amount more that can be typed here but would involve some rewriting
For the test file I have been using, I am now getting 3 seconds per 100K lines verus 11 seconds per 100K lines in the 0.18.0 version. Also, I'm not aware of any files that fail under this version, although in some cases the encoding will need to be manually specified. |
oh that sounds great! oh so you are able to infer encodings in some instances? worth mentioning in docs / doc-string? are there any other updates for doc-string / docs? |
couple of issues on windows. going to fixup. |
the airlines.sas7bdat, csv seem to be missing, can you push them up in another commit? |
done On Fri, Apr 22, 2016 at 10:54 AM, Jeff Reback notifications@github.com
|
ty. not sure why that didn't fail the tests, but oh well. |
thanks @kshedden great effort!!!! |
I may come back around with some more perf improvements in the cython code when I have some time. |
closes #12659
closes #12654
closes #12647
closes #12809
Major performance improvements through use of Cython