-
Notifications
You must be signed in to change notification settings - Fork 116
Fix for reading exported parquet #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying datachain-documentation with
|
Latest commit: |
e560b08
|
Status: | ✅ Deploy successful! |
Preview URL: | https://e36e01c3.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-1066-fix-reading-exp.datachain-documentation.pages.dev |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1071 +/- ##
=======================================
Coverage 88.68% 88.69%
=======================================
Files 152 152
Lines 13606 13609 +3
Branches 1893 1894 +1
=======================================
+ Hits 12067 12070 +3
Misses 1093 1093
Partials 446 446
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Make sense, I've overwritten it now with new source. |
src/datachain/lib/arrow.py
Outdated
if self.output_schema and hasattr(vals[0], "source"): | ||
# if we are reading parquet file written by datachain it might have | ||
# source inside of it already, so we should not duplicate it, instead | ||
# we are re-creating it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup the comment a bit ... (no extra line is needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add to docs (source=True) - that if it enabled and the file already has it - it will be rewritten (e.g. when file was generated by datachain before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
@ilongin what is the status here? |
Co-authored-by: Vladimir Rudnykh <dreadatour@gmail.com>
Co-authored-by: Vladimir Rudnykh <dreadatour@gmail.com>
for more information, see https://pre-commit.ci
Fixing the issue when we read parquet file that is being created with datachain itself and already has source fields inside it. What was happening is that we were adding duplicated source fields on read and this PR avoids that even if
source=True
flag is set.