Skip to content

add incremental processing example #1101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 20, 2025
Merged

add incremental processing example #1101

merged 2 commits into from
May 20, 2025

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented May 16, 2025

Adds an example of delta (incremental) processing.

TODO

  • fix tests / make it run as a test

Copy link

cloudflare-workers-and-pages bot commented May 16, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: e114989
Status: ✅  Deploy successful!
Preview URL: https://d3b2d34e.datachain-documentation.pages.dev
Branch Preview URL: https://add-delta-example.datachain-documentation.pages.dev

View logs

@shcheklein shcheklein requested a review from ilongin May 16, 2025 04:49
Copy link

codecov bot commented May 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.97%. Comparing base (082c4e3) to head (e114989).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1101   +/-   ##
=======================================
  Coverage   87.97%   87.97%           
=======================================
  Files         148      148           
  Lines       12747    12747           
  Branches     1783     1783           
=======================================
  Hits        11214    11214           
  Misses       1094     1094           
  Partials      439      439           
Flag Coverage Δ
datachain 87.90% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shcheklein shcheklein force-pushed the add-delta-example branch from adb5666 to 163c9b5 Compare May 17, 2025 16:35
@shcheklein shcheklein force-pushed the add-delta-example branch from d91c9e6 to e114989 Compare May 18, 2025 13:49
@shcheklein shcheklein requested a review from a team May 18, 2025 21:18
This demonstrates incremental processing - only new files are processed.
"""
chain = (
dc.read_storage("test/", update=True, delta=True, delta_on="file.path")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that without explicit delta_compare it will look for all fields in schema except file.path (since it's in delta_on already) to say if file is changed or not. This means that two rows need to be identical (all fields the same) in order to not count them as "modified / changed". If it count's them as changed there is no performance gain in delta update. You usually want to set `delta_compare=["file.version", "file.etag"].

There is the case though with non-versioned sources where file.version and file.etag are randomly set every time on re-index which causes the same thing to happen regardless as it will catch everything as modified. In this cases, and in every other case where user doesn't even want or can track changed rows, workaround is to put delta_update to be the same as delta_on but we need a better way.
Options are:

  • make default delta_compare=None to disable tracking changed rows instead to look into all fields. If we go with this path then DataChain.compare() and DataChain.diff() needs to be changed as well to be consistent.
  • Add additional flag for this, e.g delta_ignore_changed.

I'm leaning more on first option, although then user then needs to explicitly set all fields in some cases (we loose "shortcut" of default being all fields). I don't have strong opinion though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I think we are fine in this particular example (?)

There is the case though with non-versioned sources where file.version and file.etag are randomly set every time on re-index

where did you experience this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilongin please let me know ^^

Copy link
Contributor

@ilongin ilongin May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I saw it in one of our gs buckets but now I checked and it seems like it's ok. version and etag should be set to empty string if they don't exist.

Regarding your example, yea you don't need to put anything as non of the columns will be changed since you only append new files. If you would re-create files every time when calling generate_next_file then it would be a problem as for local files etag we put mtime which would mean that delta would find all files being modified every time.

@shcheklein shcheklein merged commit 8500235 into main May 20, 2025
35 checks passed
@shcheklein shcheklein deleted the add-delta-example branch May 20, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants