-
Notifications
You must be signed in to change notification settings - Fork 113
add incremental processing example #1101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying datachain-documentation with
|
Latest commit: |
e114989
|
Status: | ✅ Deploy successful! |
Preview URL: | https://d3b2d34e.datachain-documentation.pages.dev |
Branch Preview URL: | https://add-delta-example.datachain-documentation.pages.dev |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1101 +/- ##
=======================================
Coverage 87.97% 87.97%
=======================================
Files 148 148
Lines 12747 12747
Branches 1783 1783
=======================================
Hits 11214 11214
Misses 1094 1094
Partials 439 439
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
adb5666
to
163c9b5
Compare
d91c9e6
to
e114989
Compare
This demonstrates incremental processing - only new files are processed. | ||
""" | ||
chain = ( | ||
dc.read_storage("test/", update=True, delta=True, delta_on="file.path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that without explicit delta_compare
it will look for all fields in schema except file.path
(since it's in delta_on
already) to say if file is changed or not. This means that two rows need to be identical (all fields the same) in order to not count them as "modified / changed". If it count's them as changed there is no performance gain in delta update. You usually want to set `delta_compare=["file.version", "file.etag"].
There is the case though with non-versioned sources where file.version
and file.etag
are randomly set every time on re-index which causes the same thing to happen regardless as it will catch everything as modified. In this cases, and in every other case where user doesn't even want or can track changed rows, workaround is to put delta_update
to be the same as delta_on
but we need a better way.
Options are:
- make default
delta_compare=None
to disable tracking changed rows instead to look into all fields. If we go with this path thenDataChain.compare()
andDataChain.diff()
needs to be changed as well to be consistent. - Add additional flag for this, e.g
delta_ignore_changed
.
I'm leaning more on first option, although then user then needs to explicitly set all fields in some cases (we loose "shortcut" of default being all fields). I don't have strong opinion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I think we are fine in this particular example (?)
There is the case though with non-versioned sources where file.version and file.etag are randomly set every time on re-index
where did you experience this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilongin please let me know ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I saw it in one of our gs
buckets but now I checked and it seems like it's ok. version
and etag
should be set to empty string if they don't exist.
Regarding your example, yea you don't need to put anything as non of the columns will be changed since you only append new files. If you would re-create files every time when calling generate_next_file
then it would be a problem as for local files etag
we put mtime
which would mean that delta would find all files being modified every time.
Adds an example of delta (incremental) processing.
TODO