Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement CSV record reader to skip unparseable lines #14396

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented Nov 6, 2024

  • Removed skipUnParseableLines flag and always skip un-parseable lines
  • Add stopOnError flag to stop reading records on error
  • Support all formats in commons-csv
  • Also fix a bug in RecordReaderFileConfig to not access record reader when it is already closed

The new record reader is able to handle multi-line values as well as un-parseable lines. When a line cannot be parsed, it resumes from the next line.
More importantly, it doesn't throw exception in hasNext(), and the exception is cached and thrown when calling next(). This aligns with the high level abstraction of ingestion flow.

@Jackie-Jiang
Copy link
Contributor Author

cc @rajagopr @suvodeep-pyne who recently updated the record reader

@codecov-commenter
Copy link

codecov-commenter commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 84.46602% with 16 lines in your changes missing coverage. Please review.

Project coverage is 63.78%. Comparing base (59551e4) to head (eb6bcc0).
Report is 1299 commits behind head on master.

Files with missing lines Patch % Lines
.../pinot/plugin/inputformat/csv/CSVRecordReader.java 85.26% 9 Missing and 5 partials ⚠️
...pinot/spi/data/readers/RecordReaderFileConfig.java 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14396      +/-   ##
============================================
+ Coverage     61.75%   63.78%   +2.03%     
- Complexity      207     1556    +1349     
============================================
  Files          2436     2660     +224     
  Lines        133233   145949   +12716     
  Branches      20636    22359    +1723     
============================================
+ Hits          82274    93091   +10817     
- Misses        44911    45978    +1067     
- Partials       6048     6880     +832     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.74% <84.46%> (+2.04%) ⬆️
java-21 63.64% <84.46%> (+2.02%) ⬆️
skip-bytebuffers-false 63.77% <84.46%> (+2.02%) ⬆️
skip-bytebuffers-true 63.62% <84.46%> (+35.90%) ⬆️
temurin 63.78% <84.46%> (+2.03%) ⬆️
unittests 63.77% <84.46%> (+2.03%) ⬆️
unittests1 55.49% <50.48%> (+8.60%) ⬆️
unittests2 34.14% <81.55%> (+6.41%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang Jackie-Jiang force-pushed the fix_csv_record_reader branch 2 times, most recently from 0268a2c to e5405b8 Compare November 6, 2024 19:16
@Jackie-Jiang Jackie-Jiang force-pushed the fix_csv_record_reader branch from e5405b8 to eb6bcc0 Compare November 6, 2024 23:23
@KKcorps
Copy link
Contributor

KKcorps commented Nov 8, 2024

When should someone set stopOnError to true?

if set to false, will we still get exception.

@KKcorps KKcorps merged commit d03241a into apache:master Nov 8, 2024
21 checks passed
@Jackie-Jiang Jackie-Jiang deleted the fix_csv_record_reader branch November 13, 2024 21:54
@Jackie-Jiang
Copy link
Contributor Author

@KKcorps when stopOnError is set to true, the record reader will stop parsing when encountering error, and return false in hasNext(). When it is not set, hasNext() will return true, and getNext() will throw the exception, which can be handled by the index creator to either abort, or fill default value and continue when continueOnError is set.

davecromberge pushed a commit to davecromberge/pinot that referenced this pull request Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Configuration Config changes (addition/deletion/change in behavior) documentation enhancement ingestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants