add incremental processing example #1101

shcheklein · 2025-05-16T04:48:57Z

Adds an example of delta (incremental) processing.

TODO

fix tests / make it run as a test

cloudflare-workers-and-pages · 2025-05-16T04:49:08Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`e114989`
Status:	✅ Deploy successful!
Preview URL:	https://d3b2d34e.datachain-documentation.pages.dev
Branch Preview URL:	https://add-delta-example.datachain-documentation.pages.dev

View logs

codecov · 2025-05-16T04:54:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.97%. Comparing base (082c4e3) to head (e114989).
Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1101   +/-   ##
=======================================
  Coverage   87.97%   87.97%           
=======================================
  Files         148      148           
  Lines       12747    12747           
  Branches     1783     1783           
=======================================
  Hits        11214    11214           
  Misses       1094     1094           
  Partials      439      439

Flag	Coverage Δ
datachain	`87.90% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

ilongin · 2025-05-19T08:37:13Z

examples/incremental_processing/delta.py

+    This demonstrates incremental processing - only new files are processed.
+    """
+    chain = (
+        dc.read_storage("test/", update=True, delta=True, delta_on="file.path")


Note that without explicit delta_compare it will look for all fields in schema except file.path (since it's in delta_on already) to say if file is changed or not. This means that two rows need to be identical (all fields the same) in order to not count them as "modified / changed". If it count's them as changed there is no performance gain in delta update. You usually want to set `delta_compare=["file.version", "file.etag"].

There is the case though with non-versioned sources where file.version and file.etag are randomly set every time on re-index which causes the same thing to happen regardless as it will catch everything as modified. In this cases, and in every other case where user doesn't even want or can track changed rows, workaround is to put delta_update to be the same as delta_on but we need a better way.
Options are:

make default delta_compare=None to disable tracking changed rows instead to look into all fields. If we go with this path then DataChain.compare() and DataChain.diff() needs to be changed as well to be consistent.

Add additional flag for this, e.g delta_ignore_changed.

I'm leaning more on first option, although then user then needs to explicitly set all fields in some cases (we loose "shortcut" of default being all fields). I don't have strong opinion though.

thanks, I think we are fine in this particular example (?)

There is the case though with non-versioned sources where file.version and file.etag are randomly set every time on re-index

where did you experience this?

@ilongin please let me know ^^

I thought I saw it in one of our gs buckets but now I checked and it seems like it's ok. version and etag should be set to empty string if they don't exist.

Regarding your example, yea you don't need to put anything as non of the columns will be changed since you only append new files. If you would re-create files every time when calling generate_next_file then it would be a problem as for local files etag we put mtime which would mean that delta would find all files being modified every time.

shcheklein requested a review from ilongin May 16, 2025 04:49

shcheklein force-pushed the add-delta-example branch from adb5666 to 163c9b5 Compare May 17, 2025 16:35

shcheklein and others added 2 commits May 18, 2025 06:49

add incremental processing example

94b8bab

[pre-commit.ci] auto fixes from pre-commit.com hooks

e114989

for more information, see https://pre-commit.ci

shcheklein force-pushed the add-delta-example branch from d91c9e6 to e114989 Compare May 18, 2025 13:49

shcheklein requested a review from a team May 18, 2025 21:18

ilongin requested changes May 19, 2025

View reviewed changes

ilongin approved these changes May 20, 2025

View reviewed changes

shcheklein merged commit 8500235 into main May 20, 2025
35 checks passed

shcheklein deleted the add-delta-example branch May 20, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add incremental processing example #1101

add incremental processing example #1101

shcheklein commented May 16, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented May 16, 2025 •

edited

Loading

codecov bot commented May 16, 2025 •

edited

Loading

ilongin May 19, 2025

shcheklein May 19, 2025

shcheklein May 19, 2025

ilongin May 20, 2025 •

edited

Loading

add incremental processing example #1101

add incremental processing example #1101

Conversation

shcheklein commented May 16, 2025 • edited Loading

TODO

cloudflare-workers-and-pages bot commented May 16, 2025 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented May 16, 2025 • edited Loading

Codecov Report

ilongin May 19, 2025

Choose a reason for hiding this comment

shcheklein May 19, 2025

Choose a reason for hiding this comment

shcheklein May 19, 2025

Choose a reason for hiding this comment

ilongin May 20, 2025 • edited Loading

Choose a reason for hiding this comment

shcheklein commented May 16, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented May 16, 2025 •

edited

Loading

codecov bot commented May 16, 2025 •

edited

Loading

ilongin May 20, 2025 •

edited

Loading