-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement CSV record reader to skip unparseable lines #14396
Conversation
cc @rajagopr @suvodeep-pyne who recently updated the record reader |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14396 +/- ##
============================================
+ Coverage 61.75% 63.78% +2.03%
- Complexity 207 1556 +1349
============================================
Files 2436 2660 +224
Lines 133233 145949 +12716
Branches 20636 22359 +1723
============================================
+ Hits 82274 93091 +10817
- Misses 44911 45978 +1067
- Partials 6048 6880 +832
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
0268a2c
to
e5405b8
Compare
e5405b8
to
eb6bcc0
Compare
When should someone set if set to false, will we still get exception. |
@KKcorps when |
skipUnParseableLines
flag and always skip un-parseable linesstopOnError
flag to stop reading records on errorcommons-csv
RecordReaderFileConfig
to not access record reader when it is already closedThe new record reader is able to handle multi-line values as well as un-parseable lines. When a line cannot be parsed, it resumes from the next line.
More importantly, it doesn't throw exception in
hasNext()
, and the exception is cached and thrown when callingnext()
. This aligns with the high level abstraction of ingestion flow.