-
Notifications
You must be signed in to change notification settings - Fork 147
Some fixes for multithreaded Context parsing #1073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Bump. This fixes a real bug and a race condition. |
| end | ||
|
|
||
| function findnextnewline(pos, stop, buf, opts) | ||
| while pos < stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop here seems to be the last byte of the current chunk; why is it < here instead of <=? I understand the change below from len = ranges[i + 1] to len = ranges[i + 1] - 1, but it seems like we still want to check each byte in the chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for example, on line 372 we're doing while pos <= len...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
This function findnextnewline only returns the position of the first encountered newline and defaults by returning stop. So, whether there is no newline, or whether the newline happens on the last byte, it will return stop either way and there is no side-effect to the function, so we might as well skip checking that last byte.
The default return value comes from the assumption that each chunk should end with a newline. Line 450 this assumption is verified by the preprocessing of ranges from lines 466-480, while on the call line 470 stop is simply the end of the buffer.
quinnj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good IMO; I left one question though. I'm also not sure why CI is not running at all here? We definitely want to make sure we have a good test run before merging.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1073 +/- ##
==========================================
+ Coverage 90.22% 90.40% +0.18%
==========================================
Files 9 9
Lines 2270 2293 +23
==========================================
+ Hits 2048 2073 +25
+ Misses 222 220 -2
☔ View full report in Codecov by Sentry. |
|
Thanks @Liozou! |
|
You're welcome, thanks for the review and merge! I believe this should also fix #1089, I checked locally that the bug occurs when using e.g. |
* Reinitialize column types after failed chunking * Fix race condition on ranges in findchunkrowstart * Reinitialize column type after limit row start detection * Preprocess ranges to avoid cutting values in chunk row start detection * Fix using too large limit * Fix off-by-one in chunk row detection for last line * Add tests * Make multithreaded parsing failure loud
I had a CSV file with only
Float64whose columns were sometimes parsed asString, sometimes asFloat64when using multiple threads. Whether it happened and which columns were affected changed from one execution to the other, which made the issue difficult to track.This PR fixes this behaviour and a couple of issues encountered in the same area of the code. Most notably:
findchunkrowstart, where threadireadsranges[i+1]and writes toranges[i].ntasksvalue #1010).I also made multi-threading parsing failure a loud
@errorto prevent performance pitfalls to go unnoticed (last commit). This can be easily reverted if you think it can lead to too many spurious logs.Fix #1010, fix #1047 and added tests (which now include one occurrence of the new
@error).