-
Notifications
You must be signed in to change notification settings - Fork 113
Fix for reading exported parquet #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain-documentation with
|
Latest commit: |
7a05652
|
Status: | ✅ Deploy successful! |
Preview URL: | https://1024bc17.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-1066-fix-reading-exp.datachain-documentation.pages.dev |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1071 +/- ##
==========================================
- Coverage 87.94% 87.92% -0.03%
==========================================
Files 148 148
Lines 12752 12755 +3
Branches 1783 1784 +1
==========================================
Hits 11215 11215
- Misses 1098 1100 +2
- Partials 439 440 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Make sense, I've overwritten it now with new source. |
if self.output_schema and hasattr(vals[0], "source"): | ||
# if we are reading parquet file written by datachain it might have | ||
# source inside of it already, so we should not duplicate it, instead | ||
# we are re-creating it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup the comment a bit ... (no extra line is needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add to docs (source=True) - that if it enabled and the file already has it - it will be rewritten (e.g. when file was generated by datachain before)
Fixing the issue when we read parquet file that is being created with datachain itself and already has source fields inside it. What was happening is that we were adding duplicated source fields on read and this PR avoids that even if
source=True
flag is set.