Delta Table Diff - MVP design #4902

Jonathan-Rosenberg · 2022-12-29T12:29:26Z

johnnyaug · 2022-12-29T13:54:14Z

design/open/delta-diff.md

+they cannot send them through the GUI (which is the UI we chose to implement this feature for the MVP).
+To overcome this scenario, we'll use special diff credentials as follows:
+1. User makes a Delta Diff request.
+2. The DiffService checks if the user has "diff credentials" in the DB:


This is essentially a session. Maybe we should implement a session mechanism as part of this feature?

I think that we should do this, but not as part of the MVP. It is not worth blocking this project on that requirement.

That said, we might want to spec out that requirement and get the ball rolling.

johnnyaug

Love the plugin idea!

johnnyaug · 2022-12-29T13:58:49Z

design/open/delta-diff.md

+
+### Packaging
+
+1. We will package the binary with the lakeFS release.


Consider publishing it separately, to avoid overloading the tar-ball.

talSofer

Great writeup @Jonathan-Rosenberg, thank you!
LGTM, asked some clarification questions.

talSofer · 2023-01-01T06:52:43Z

design/open/delta-diff.md

+
+1. For the MVP, support only Delta table diff.
+2. The system should be open for different data comparison implementations.
+3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.


and the change in the number of rows.

We've agreed that this one is optional

talSofer · 2023-01-01T06:55:35Z

design/open/delta-diff.md

+1. For the MVP, support only Delta table diff.
+2. The system should be open for different data comparison implementations.
+3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
+4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.


Suggested change

4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.

4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in the topic branch and not in the base branch.

We need to decide the requirement here. AFAICT what @talSofer suggests is what we commonly call "two-dots" diff. The section "Commit ranges" in the Git book is probably the best explanation.

I would like us to pick one without much consideration: I do not attach much importance to this decision and it is not worth wasting time! Mostly because AFAIU if we implement one we can easily implement the other:

3-dots (@Jonathan-Rosenberg's) is "find the common base of A and B and then print everything from that common base to A and from the common base to B"

2-dots (@talSofer 's) is "find the common base of A and B and then print everything from that common base to B".

So 2-dots is a bit easier because 3-dots needs to merge two streams. But that's a tiny fraction of the required code...

I actually meant two-dots... Fixing

talSofer · 2023-01-01T06:57:58Z

design/open/delta-diff.md

+
+## Non-Goals and off-scope
+
+1.  The Delta diff will be limited to the available Delta Log entries (the JSON files).


I would link to https://docs.databricks.com/delta/history.html#configure-data-retention-for-time-travel, that explains the time travel limitations delta lake defines

Delta tables creates as a shallow clone are unsupported

Adding.

I don't really think that it's a goal or a non-goal as it doesn't affect any implementation of the table diff...

talSofer · 2023-01-01T07:35:27Z

design/open/delta-diff.md

+#### Delta Diff Plugin
+
+Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.  
+To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.  


Can you please elaborate on the following:

this is a constraint imposed by the delta-rs package

What constraint are you referring to?

Given that we only query the delta table history, what type of data will flow through the lakeFS server? is this only delta lake tables metadata? I'm asking with scale and privacy considerations in mind.

The delta-rs package doesn't allow you to configure your own client, rather it supports a few configured clients for S3, GCP, and Azure, thus we cannot configure a client like lakeFSFS that directly communicates with the underlying store.

The data that will flow through the lakeFS server will be only delta log files (files under _delta_log), and specifically, only the Delta Log entries, i.e. the Delta log JSON version files.

Would you like me to document the above?

Thanks!
IMO it worth documenting this answer

The data that will flow through the lakeFS server will be only delta log files (files under _delta_log), and specifically, only the Delta Log entries, i.e. the Delta log JSON version files.

Is this a future advantage to contributing our own client to Delta?

talSofer · 2023-01-01T07:39:29Z

design/open/delta-diff.md

+To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.  
+The diff algorithm:
+1. Run the Delta [HISTORY command](https://docs.delta.io/latest/delta-utility.html#history-schema) on both table paths.
+2. Traverse through the [returned "commitInfo" entry vector ](https://github.com/delta-io/delta-rs/blob/main/rust/src/delta.rs#L888)


nit; it is unclear what I should expect to see in this link

Not to self: don't count on a line number under a branch-namespace (as opposed to a commit namespace)

Thanks!
fixing

talSofer · 2023-01-01T07:41:32Z

design/open/delta-diff.md

+        2. Compare the hashes of the versions.
+        3. If they **aren't equal**, add the "left"'s entry to the returned history list, else break and **return the history vector**.
+        4. Traverse one version back in both vectors.
+3. Return an empty history vector.


Didn't you mean to return the history list?

talSofer · 2023-01-01T07:54:02Z

design/open/delta-diff.md

+1. User makes a Delta Diff request.
+2. The DiffService checks if the user has "diff credentials" in the DB:
+    1. If there are such credentials, it will use them.
+    2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".


Do you mean to generate temporary credentials? how?

No,
These are applicative credentials just as we have now.
The difference is the DIFF prefix.

talSofer · 2023-01-01T08:05:53Z

design/open/delta-diff.md

+- The total number of versions returned - this is the number of files that were read from lakeFS. This is basically the size of a Delta Log. We can use it later on to optimize reading.
+
+### Business Statistics
+- The number of unique requests - get the number of Delta Lake users.


Did you mean "get the number of Delta Lake diff users."?
What defines a unique request?

Both.
I would like to understand how many of our users use Delta Lake. I guess that not all of our Delta users will use the Delta Diff, but it might give us a better idea than the tools we currently have.
A unique request is defined by an installation id.

Thanks, then consider using -

Suggested change

- The number of unique requests - get the number of Delta Lake users.

- The number of unique requests by installation id - get the lower bound of the number of Delta Lake users.

arielshaqed

Extremely neat stuff, thanks!

arielshaqed · 2023-01-01T08:24:37Z

design/open/delta-diff.md

+1. For the MVP, support only Delta table diff.
+2. The system should be open for different data comparison implementations.
+3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
+4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.


We need to decide the requirement here. AFAICT what @talSofer suggests is what we commonly call "two-dots" diff. The section "Commit ranges" in the Git book is probably the best explanation.

I would like us to pick one without much consideration: I do not attach much importance to this decision and it is not worth wasting time! Mostly because AFAIU if we implement one we can easily implement the other:

3-dots (@Jonathan-Rosenberg's) is "find the common base of A and B and then print everything from that common base to A and from the common base to B"

2-dots (@talSofer 's) is "find the common base of A and B and then print everything from that common base to B".

So 2-dots is a bit easier because 3-dots needs to merge two streams. But that's a tiny fraction of the required code...

arielshaqed · 2023-01-01T08:26:24Z

design/open/delta-diff.md

+2. The system should be open for different data comparison implementations.
+3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
+4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
+5. UI: GUI only.


For development purposes I propose that we still plan an internal milestone for a hidden / experimental / ... lakectl feature. Having such a tool means that we can work on the backend regardless of the GUI frontend. So it increases velocity by reducing dependencies.

sounds good

arielshaqed · 2023-01-01T08:28:21Z

design/open/delta-diff.md

+3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
+4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
+5. UI: GUI only.
+6. Reduce user friction as much as possible.


Not sure that this is an effective goal: when do we see it making a difference to our implementation efforts? E.g., "add a combo to select a Delta table" will clearly reduce user friction, but is out of scope here.

I guess that it's in the context of the other goals and non-goals specified.

arielshaqed · 2023-01-01T08:29:22Z

design/open/delta-diff.md

+
+## Non-Goals and off-scope
+
+1.  The Delta diff will be limited to the available Delta Log entries (the JSON files).


Can we support log compaction? (If we can, that's awesome and we should include it as a goal!)

No, the support of log compaction is actually out of scope.

arielshaqed · 2023-01-01T08:34:17Z

design/open/delta-diff.md

+##### Diff as an external binary executable
+
+We can trigger a diff binary from lakeFS using the `os.exec` package and get the output generated by that binary.
+This is almost where we want to be, except that we'll need to somehow validate that the lakeFS server and the executable are using the same "API version" to communicate, so that the output of the binary would match the expected one from lakeFS. In addition, I would like to decrease the amount of deserialization needed to be implemented to interpret the returned DTO (maintainability and extensibility-wise).


Not sure why e.g. JSON or protobuf are a development overhead: We already use both inside lakeFS.

arielshaqed · 2023-01-01T08:58:37Z

design/open/delta-diff.md

+they cannot send them through the GUI (which is the UI we chose to implement this feature for the MVP).
+To overcome this scenario, we'll use special diff credentials as follows:
+1. User makes a Delta Diff request.
+2. The DiffService checks if the user has "diff credentials" in the DB:


I think that we should do this, but not as part of the MVP. It is not worth blocking this project on that requirement.

That said, we might want to spec out that requirement and get the ball rolling.

arielshaqed · 2023-01-01T09:00:54Z

design/open/delta-diff.md

+1. User makes a Delta Diff request.
+2. The DiffService checks if the user has "diff credentials" in the DB:
+    1. If there are such credentials, it will use them.
+    2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".


We should get someone from the auth team to go over this!

I would like us to declare this solution is explicitly temporary. There are at least 2 auth requirements in here:

Support a note on credentials

Allow temporary credentials.

(Possibly...) allow attaching a policy with reduced permissions to such credentials.

This solution involved @guy-har. Adding as a reviewer...

Absolutely.

Generate for missing is good short term solution - it can also be an issue when two services will try to generate while currently we don't lock this option means we will have two sets of credentials.

Ok,
I'll mark the temporary solutions (for the MVP) all around the document.

design/open/delta-diff.md

arielshaqed · 2023-01-01T09:20:24Z

design/open/delta-diff.md

+  [
+       {
+           "version": 1,
+           "timestamp":1515491537026,
+           "operation":"INSERT",
+           "operationContent":{
+               "operationParameters": {
+                  "mode":"Append",
+                  "partitionBy":"[]"
+                }


Please document this schema. If it's just the Delta schema, can you paste a link to it?

arielshaqed · 2023-01-01T09:20:59Z

design/open/delta-diff.md

+
+### Applicative Metrics
+- Diff runtime
+- The total number of versions returned - this is the number of files that were read from lakeFS. This is basically the size of a Delta Log. We can use it later on to optimize reading.


Doesn't compaction change this?

compaction == deleted history -> not part of the history command

…s configuration file

Co-authored-by: Ariel Shaqed (Scolnicov) <ariels@treeverse.io>

…elta-diff

itaiad200

Thanks, looks promising and well thought though! Reviewing from phone so apologies if I'm reiterating on answered questions

itaiad200 · 2023-01-03T03:24:45Z

design/open/delta-diff.md

+
+The DiffService will be an internal component in lakeFS which will serve as the Core system.  
+In order to realize which diff plugins are available, we shall use new configuration values:
+1. `LAKEFS_PLUGINS_LOCATION`


It's better if we use the same prefix for all configuration values related to plugins.

itaiad200 · 2023-01-03T03:25:57Z

design/open/delta-diff.md

+![Delta Diff flow](diagrams/delta-diff-flow.png)
+[(excalidraw file)](diagrams/delta-diff-flow.excalidraw)
+
+#### DiffService


How does one configures the plugins, i.e. passes configuration values?

Through the request to the plugin (think of Terraform providers as an example) and this

itaiad200 · 2023-01-03T03:30:36Z

design/open/delta-diff.md

+
+Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.  
+To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.  
+The diff algorithm:


What is the expected amount of calls to the gateway? How much data is to be retrieved from the object store? Two things to consider:

IIUC we have a request waiting in the background, we should be fast..

How we scale.. If users perform several concurrent requests, will we choke?

As we don't really know the amount of Delta Log files used on average (or median, or max) the amount of data is currently an unknown. There will be two calls to the S3 gateway- one history command for each table.
One of the metrics specified below is getting the number of files That Delta users use.
Regarding point 1, for the MVP, this process can be slow (the whole thing is tagged as experimental), but it will help us evaluate the concurrency and resources needed for the Diff.
Regarding point 2, scaling the requests from the Diff service to the diff plugin is pretty easy as it uses an HTTP/2 connection to communicate (we can run multiple streams on a single connection). If a user makes multiple requests for a diff we can implement a backoff in the plugin itself.

Not sure I understand, is "history" a gateway command or a delta one? I'm not familiar with that gw endpoint so I'm guessing it's a delta command that executes s3 requests to the gw. I suggest we have a rough estimate for the number and type of calls now. I'm sure we'll be able to handle it but it's better to understand the costs sooner rather than later.
By scale I meant the number of calls to the gateway. If a single diff performs 100 calls that fetches data, then 10 diffs performance is something to consider. I agree that the service-plugin communication scale is not a concern at this point.

Gotcha.
So the delta history command is quite an expensive command- it will send a request to fetch each delta log (JSON) file one by one from the gateway. That means that if there are N log entries, then N requests will get sent to the gateway.
This is surely not the way we would want our diff to eventually work, but at this point (for the case of an MVP, and to verify that a diff is even needed) I don't think we should work towards making it efficient.

+1 👍 to measure the number of calls a "simple" history performs.

The second applicative metric will basically do that.

itaiad200 · 2023-01-03T03:40:18Z

design/open/delta-diff.md

+
+### API
+
+- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`


What about pagination/limitations?

Thought we need to provide a path to the table - there are two refs here

The location of otf is after the repository - what does it represents in this level? is it just a different diff operation?

@itaiad200 I'll add a description to your points.

@nopcoder you're right, it should be otf/table/{left_table_path}/diff/{right_table_path}?type={diff_type}
The otf is to specify that this API is strictly for OTFs

nopcoder

Looks good! Added some comments and concerns.

nopcoder · 2023-01-03T16:52:38Z

design/open/delta-diff.md

+## User Story
+
+Jenny the data engineer runs a new retention-and-column-remover ETL over multiple Delta Lake tables. To test the ETL before running on production, she uses lakeFS (obviously) and branches out to a dedicated branch `exp1`, and runs the ETL pointed at `exp1`.
+The output is not what Jenny planned... The Delta Table is missing multiple columns and the number of rows just got bigger!


Thought we are not addressing data (number of rows just got bigger) - just metadata.

yes, this is optional (we know we are getting it back from the Delta History command as part of the metadata), and also this is what Jenny wants...

nopcoder · 2023-01-03T17:08:06Z

design/open/delta-diff.md

+
+---
+
+## Implementation


Think we should have just one numbering from the start of the flow to the end. Note that number one will be the user's request.

The request and the use of the diff plugin system should consider different type? so for example if we like to implement csv diff using the plugin system and a so the user request and plugin will identify which formats the plugin supports.

Ok

Not sure I understand. What I meant in the diagram is that the request will include the type of diff that the user wishes to see. If it's a diff of CSV data, the user will send type=csv in the request, and the DiffService will realize that the CSV diff plugin is needed to create the diff.

about item 2 - I mean that a single plugin can serve multiple formats - so the mapping of the plugin system may look different. As we load a single set of plugins and each plugin publish the formats it supports or in the the configuration level map each plugin to one or more formats.

I think that this step should be decided by the plugin "port", e.g. "DiffService".
If plugin A can serve multiple formats, then the configurations might look like:

diff: delta: plugin: A csv: plugin: A iceberg: plugin: A merge: delta: plugin: A plugins: A: ~/.lakefs/plugins/A

At this point, every core (port) component can choose according to the specified format of the plugin it needs. And a plugin can be used multiple times for different formats.

A bigger point here is that we will need some way for lakeFS to detect what type it should offer and/or request. But this is in no way blocking for the MVP, and I am perfectly happy to skip it for now! (For instance, we might discover that users want a diff but don't want to initiate it from lakeFS.)

nopcoder · 2023-01-03T17:12:01Z

design/open/delta-diff.md

+![Delta Diff flow](diagrams/delta-diff-flow.png)
+[(excalidraw file)](diagrams/delta-diff-flow.excalidraw)
+
+#### DiffService


DeltaDiffPlugin? or lakeFSDeltaDiffPlugin?
Think that DiffService is misleading as we are planing to use a plugin system to extend lakeFS diff.

The DiffService is not the plugin itself, it's the component within lakeFS that interacts with the plugins.
What about PlugableDiffService?

ok, got it - just didn't want to couple the plugin system to the DiffService. I would like that any service in lakeFS and access the plugins. This way we can implement merge and future capabilities as needed without adding another plugin system.

nopcoder · 2023-01-04T06:05:47Z

design/open/delta-diff.md

+#### DiffService
+
+The DiffService will be an internal component in lakeFS which will serve as the Core system.  
+In order to realize which diff plugins are available, we shall use new configuration values:


There is no manifest / configuration to map which plugins to load? from reading this section we have location and the specific delta diff information. Suggest you define the plugin configuration without sharing the location.
Example:
plugins: diff: delta: <location to the delta diff plugin - full path to the binary needed to execute>

In case we like to map multiple plugins to the same format we should consider separating the plugins configuration and the use cases. Like:
diff: otf: delta: plugin: diff_delta_v2 plugins: diff_delta_v1: <location to the delta diff plugin - full path to the binary needed to execute> diff_delta_v2: <location to the delta diff plugin - full path to the binary needed to execute>

Another consideration - if we plan to use the plugin later also to support advance merge - we can just drop the diff from all the above examples.

In case we would like to leverage more plugins for different use-cases (ex: merge) will it possible to share the same location for different plugin interfaces or we need to split / catalog them now?

diff: otf: delta: plugin: diff_delta_v2 plugins: diff_delta_v1: <location to the delta diff plugin - full path to the binary needed to execute> diff_delta_v2: <location to the delta diff plugin - full path to the binary needed to execute>

Really liked this one, although I would still want to have default values so that if the user didn't specify a plugin location in the manifest, lakeFS will try to figure it out using the latest known version of the plugin and try to locate it in some default path. For example, if the user didn't specify the location for diff.delta.plugin, lakeFS will try to locate the binary under ~/.lakefs/plugins.

Another consideration - if we plan to use the plugin later also to support advance merge - we can just drop the diff from all the above examples.

In case we would like to leverage more plugins for different use-cases (ex: merge) will it possible to share the same location for different plugin interfaces or we need to split / catalog them now?

I think that it's a very possible scenario, yet I think that it might be preferable that each plugin would know how to do one thing. Anyway, we can decide upon it when we get there...

nopcoder · 2023-01-04T06:21:19Z

design/open/delta-diff.md

+
+#### Delta Diff Plugin
+
+Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.  


We should specify the constraint of fetching large objects from delta-rs. Configurable (optional) location to store the temporary data downloaded by the plugin - will probably be required in case we don't want to use the default temporary folder or control/strict the size used by the plugin.

Cleanup or this storage if it is not managed by the OS temp dir/files API.

Calling the plugin with 'deadline' (timeout for the grpc call) just to prevent a lock during the call to the plugin.

nopcoder · 2023-01-04T06:23:33Z

design/open/delta-diff.md

+1. User makes a Delta Diff request.
+2. The DiffService checks if the user has "diff credentials" in the DB:
+    1. If there are such credentials, it will use them.
+    2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".


Generate for missing is good short term solution - it can also be an issue when two services will try to generate while currently we don't lock this option means we will have two sets of credentials.

nopcoder · 2023-01-04T06:27:17Z

design/open/delta-diff.md

+
+### API
+
+- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`


Thought we need to provide a path to the table - there are two refs here

The location of otf is after the repository - what does it represents in this level? is it just a different diff operation?

nopcoder · 2023-01-04T06:28:50Z

design/open/delta-diff.md

+      - operationContent:
+        - type: map
+        - description: an operation content specific to the table format implemented.


Just verify - what types the map will hold? we need to serialize any json object?

The map for Delta will hold similar data to the commitInfo structure's operationParameters section

nopcoder · 2023-01-04T06:31:01Z

design/open/delta-diff.md

+
+- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`
+    - Tagged as experimental
+    - **Response**:  


The response format render log of operations - maybe we can scope it at this level - so if the next change will return a response with diff - we can just return a empty list/log here.

nopcoder · 2023-01-04T06:31:36Z

design/open/delta-diff.md

+
+---
+
+### Packaging


*Build and packaging - we need to address the build part too

arielshaqed

T-h-a-n-k-s-!

arielshaqed · 2023-01-04T10:34:26Z

design/open/delta-diff.md

+
+The DiffService will be an internal component in lakeFS which will serve as the Core system.  
+In order to realize which diff plugins are available, we shall use new configuration values:
+1. `LAKEFS_PLUGINS_LOCATION`


True. So everything below should also be LAKEFS_PLUGINS_LOCATIONS. That means that everything lives inside the same configuration object.

Also as specificed this might not really be a plugin: I don't know whether Viper will let us configure an "any object" and be able to pass it usefully to the plugin during load.

If this turns out to be impossible, I would suggest configuring a separate "plugins file"; that one can be a general YAML/JSON object with very few parsed fields, and then each plugin gets its configuration subobject from that file (and presumably parses it using mapstructure in Go, or libwhatever in other languages).

arielshaqed · 2023-01-04T10:34:52Z

design/open/delta-diff.md

+2. `LAKEFS_PLUGIN_DIFF_{TYPE OF DIFF}`
+   The DiffService will use the name of binary provided by this env var to perform the diff. For instance, `LAKEFS_PLUGIN_DIFF_DELTA=deltaDiffBinary`
+   The DiffService will use the [plugins' location](https://github.com/hashicorp/terraform/blob/main/plugins.go) and the diff binary to load the plugin and request a diff.
+   The **type** of diff will be sent as part of the request to lakeFS as specified [here](#API).


arielshaqed · 2023-01-04T10:57:36Z

design/open/delta-diff.md

+#### Delta Diff Plugin
+
+Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.  
+To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.  


Is this a future advantage to contributing our own client to Delta?

arielshaqed · 2023-01-04T10:59:00Z

design/open/delta-diff.md

+
+Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.  
+To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.  
+The diff algorithm:


+1 👍 to measure the number of calls a "simple" history performs.

N-o-Z

Love the design - looks well thought through.
I don't have much to add to the already existing comments.
One thing though, maybe you can reduce the images sizes? For example the delta-diff-flow image can't be read from the document and when opening it in a different window it pops up very big making it hard to see the entire flow

nopcoder

Left couple of comments - the concern I left was related to plugin system we are adding - I think it should serve more than just the diff if possible. I assume the next service will be called MergeService and it may use the same or different set of plugins and I think it is a place to use a single system that can serve different functionality.

Jonathan-Rosenberg · 2023-01-08T09:47:20Z

@nopcoder your concern is a very valid one.
I think that this design still answers it.
Plugins can be developed and used in a variety of ways with different services (Merge, Diff, Rebase...) by setting the correct configuration file fields:

diff: // service/functionality that uses the plugins
  delta:
    plugin: A
  csv:
    plugin: A
  iceberg:
    plugin: A
merge: // service/functionality that uses the plugins
  delta:
    plugin: A
rebase: // service/functionality that uses the plugins
  parquet:
    plugin: B
plugins:
  A: ~/.lakefs/plugins/A
  B: ~/.lakefs/plugins/B

arielshaqed

Thanks!

arielshaqed · 2023-01-08T09:51:27Z

design/open/delta-diff.md

+The Microkernel/Plugin architecture is composed of two entities: a single "core system" and multiple "plugins".  
+In our case, the lakeFS server act as the core system, and the different diff implementations, including the Delta diff implementation, will act as plugins.  
+We'll use `gRPC` as the transport protocol, which makes the language of choice almost immaterial (due to the use of protobufs as the transferred data format)
+as long as it's self-contained (theoretically it can also be not system-runtime-dependent but then the cost will be an added requirement for running lakeFS- runtime support).


Long-term we will need and have a generic external service interface: think of a user who wants to add or modify a diff/merge service. I believe they will prefer a looser coupling.
Short- and medium-term I agree with you that using go-plugin makes sense if we believe that it shortens initial development time.

Counterpoint: If we implement the plugin service as a separate REST microservice, can we drive it directly from the GUI, and then we don't need to add anything to lakeFS (server)? Might skip having to agree with team VE about configuration, adding go-plugin, packaging, etc.

arielshaqed · 2023-01-08T09:54:10Z

design/open/delta-diff.md

+
+---
+
+## Implementation


A bigger point here is that we will need some way for lakeFS to detect what type it should offer and/or request. But this is in no way blocking for the MVP, and I am perfectly happy to skip it for now! (For instance, we might discover that users want a diff but don't want to initiate it from lakeFS.)

nopcoder

Thanks

Jonathan-Rosenberg added 2 commits December 29, 2022 13:49

add an open delta diff design

c7530c5

operationParameters -> operationContent

ea96ddb

Jonathan-Rosenberg requested review from nopcoder, johnnyaug, arielshaqed, talSofer and lynnro314 December 29, 2022 12:29

Jonathan-Rosenberg added the exclude-changelog PR description should not be included in next release changelog label Dec 29, 2022

Jonathan-Rosenberg requested review from guy-har, ortz, eden-ohana, itaiad200 and N-o-Z and removed request for ortz and guy-har December 29, 2022 12:51

johnnyaug reviewed Dec 29, 2022

View reviewed changes

johnnyaug approved these changes Dec 29, 2022

View reviewed changes

Jonathan-Rosenberg requested a review from ozkatz December 29, 2022 16:53

Jonathan-Rosenberg added 2 commits December 30, 2022 18:57

update the diff algorithm

dd50d99

fix typo

7bf4ddd

talSofer approved these changes Jan 1, 2023

View reviewed changes

arielshaqed reviewed Jan 1, 2023

View reviewed changes

Jonathan-Rosenberg added 3 commits January 1, 2023 13:07

three-dot -> two-dot. PR changes

d7c47a5

change the default location of the plugins binary location to lakeFS'…

68f3941

…s configuration file

PR changes

5289a62

Jonathan-Rosenberg requested a review from guy-har January 1, 2023 12:21

Jonathan-Rosenberg and others added 3 commits January 1, 2023 14:24

remove use of env vars as a mean to pass AWS credentials

82c4937

Co-authored-by: Ariel Shaqed (Scolnicov) <ariels@treeverse.io>

add response schema description

f0985cc

Merge remote-tracking branch 'origin/design/delta-diff' into design/d…

b56008c

…elta-diff

itaiad200 reviewed Jan 3, 2023

View reviewed changes

nopcoder reviewed Jan 4, 2023

View reviewed changes

arielshaqed reviewed Jan 4, 2023

View reviewed changes

Jonathan-Rosenberg added 2 commits January 5, 2023 17:07

update Delta Diff design

6a7486f

update delta-diff-flow excalidraw

469e446

Jonathan-Rosenberg requested review from itaiad200, arielshaqed and nopcoder January 5, 2023 15:12

N-o-Z approved these changes Jan 8, 2023

View reviewed changes

nopcoder requested changes Jan 8, 2023

View reviewed changes

Jonathan-Rosenberg requested a review from nopcoder January 8, 2023 09:42

arielshaqed approved these changes Jan 8, 2023

View reviewed changes

nopcoder approved these changes Jan 8, 2023

View reviewed changes

Jonathan-Rosenberg merged commit 2a338f0 into master Jan 8, 2023

Jonathan-Rosenberg deleted the design/delta-diff branch January 8, 2023 13:51


		### Packaging

		1. We will package the binary with the lakeFS release.

	4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
	4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in the topic branch and not in the base branch.


		## Non-Goals and off-scope

		1. The Delta diff will be limited to the available Delta Log entries (the JSON files).

	- The number of unique requests - get the number of Delta Lake users.
	- The number of unique requests by installation id - get the lower bound of the number of Delta Lake users.


		### API

		- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`


		#### Delta Diff Plugin

		Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.


		---

		## Implementation

Delta Table Diff - MVP design #4902

Delta Table Diff - MVP design #4902

Conversation

Jonathan-Rosenberg commented Dec 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnnyaug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talSofer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itaiad200 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nopcoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jonathan-Rosenberg Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jonathan-Rosenberg commented Dec 29, 2022 •

edited

Loading

Jonathan-Rosenberg Jan 4, 2023 •

edited

Loading