Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Table Diff - MVP design #4902

Merged
merged 12 commits into from
Jan 8, 2023
Merged

Delta Table Diff - MVP design #4902

merged 12 commits into from
Jan 8, 2023

Conversation

Jonathan-Rosenberg
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg commented Dec 29, 2022

Closes #4832

@Jonathan-Rosenberg Jonathan-Rosenberg added the exclude-changelog PR description should not be included in next release changelog label Dec 29, 2022
@Jonathan-Rosenberg Jonathan-Rosenberg requested review from guy-har, ortz, eden-ohana, itaiad200 and N-o-Z and removed request for ortz and guy-har December 29, 2022 12:51
they cannot send them through the GUI (which is the UI we chose to implement this feature for the MVP).
To overcome this scenario, we'll use special diff credentials as follows:
1. User makes a Delta Diff request.
2. The DiffService checks if the user has "diff credentials" in the DB:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially a session. Maybe we should implement a session mechanism as part of this feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should do this, but not as part of the MVP. It is not worth blocking this project on that requirement.

That said, we might want to spec out that requirement and get the ball rolling.

Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the plugin idea!


### Packaging

1. We will package the binary with the lakeFS release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider publishing it separately, to avoid overloading the tar-ball.

Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great writeup @Jonathan-Rosenberg, thank you!
LGTM, asked some clarification questions.


1. For the MVP, support only Delta table diff.
2. The system should be open for different data comparison implementations.
3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the change in the number of rows.

We've agreed that this one is optional

1. For the MVP, support only Delta table diff.
2. The system should be open for different data comparison implementations.
3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in the topic branch and not in the base branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide the requirement here. AFAICT what @talSofer suggests is what we commonly call "two-dots" diff. The section "Commit ranges" in the Git book is probably the best explanation.

I would like us to pick one without much consideration: I do not attach much importance to this decision and it is not worth wasting time! Mostly because AFAIU if we implement one we can easily implement the other:

  • 3-dots (@Jonathan-Rosenberg's) is "find the common base of A and B and then print everything from that common base to A and from the common base to B"
  • 2-dots (@talSofer 's) is "find the common base of A and B and then print everything from that common base to B".

So 2-dots is a bit easier because 3-dots needs to merge two streams. But that's a tiny fraction of the required code...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant two-dots... Fixing


## Non-Goals and off-scope

1. The Delta diff will be limited to the available Delta Log entries (the JSON files).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would link to https://docs.databricks.com/delta/history.html#configure-data-retention-for-time-travel, that explains the time travel limitations delta lake defines
  2. Delta tables creates as a shallow clone are unsupported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Adding.
  2. I don't really think that it's a goal or a non-goal as it doesn't affect any implementation of the table diff...

#### Delta Diff Plugin

Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.
To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on the following:

this is a constraint imposed by the delta-rs package

What constraint are you referring to?

  • Given that we only query the delta table history, what type of data will flow through the lakeFS server? is this only delta lake tables metadata? I'm asking with scale and privacy considerations in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delta-rs package doesn't allow you to configure your own client, rather it supports a few configured clients for S3, GCP, and Azure, thus we cannot configure a client like lakeFSFS that directly communicates with the underlying store.

The data that will flow through the lakeFS server will be only delta log files (files under _delta_log), and specifically, only the Delta Log entries, i.e. the Delta log JSON version files.

Would you like me to document the above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
IMO it worth documenting this answer

The data that will flow through the lakeFS server will be only delta log files (files under _delta_log), and specifically, only the Delta Log entries, i.e. the Delta log JSON version files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a future advantage to contributing our own client to Delta?

To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.
The diff algorithm:
1. Run the Delta [HISTORY command](https://docs.delta.io/latest/delta-utility.html#history-schema) on both table paths.
2. Traverse through the [returned "commitInfo" entry vector ](https://github.com/delta-io/delta-rs/blob/main/rust/src/delta.rs#L888)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; it is unclear what I should expect to see in this link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to self: don't count on a line number under a branch-namespace (as opposed to a commit namespace)

Thanks!
fixing

2. Compare the hashes of the versions.
3. If they **aren't equal**, add the "left"'s entry to the returned history list, else break and **return the history vector**.
4. Traverse one version back in both vectors.
3. Return an empty history vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you mean to return the history list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

1. User makes a Delta Diff request.
2. The DiffService checks if the user has "diff credentials" in the DB:
1. If there are such credentials, it will use them.
2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to generate temporary credentials? how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No,
These are applicative credentials just as we have now.
The difference is the DIFF prefix.

- The total number of versions returned - this is the number of files that were read from lakeFS. This is basically the size of a Delta Log. We can use it later on to optimize reading.

### Business Statistics
- The number of unique requests - get the number of Delta Lake users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "get the number of Delta Lake diff users."?
What defines a unique request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both.
I would like to understand how many of our users use Delta Lake. I guess that not all of our Delta users will use the Delta Diff, but it might give us a better idea than the tools we currently have.
A unique request is defined by an installation id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, then consider using -

Suggested change
- The number of unique requests - get the number of Delta Lake users.
- The number of unique requests by installation id - get the lower bound of the number of Delta Lake users.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely neat stuff, thanks!

1. For the MVP, support only Delta table diff.
2. The system should be open for different data comparison implementations.
3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide the requirement here. AFAICT what @talSofer suggests is what we commonly call "two-dots" diff. The section "Commit ranges" in the Git book is probably the best explanation.

I would like us to pick one without much consideration: I do not attach much importance to this decision and it is not worth wasting time! Mostly because AFAIU if we implement one we can easily implement the other:

  • 3-dots (@Jonathan-Rosenberg's) is "find the common base of A and B and then print everything from that common base to A and from the common base to B"
  • 2-dots (@talSofer 's) is "find the common base of A and B and then print everything from that common base to B".

So 2-dots is a bit easier because 3-dots needs to merge two streams. But that's a tiny fraction of the required code...

2. The system should be open for different data comparison implementations.
3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
5. UI: GUI only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For development purposes I propose that we still plan an internal milestone for a hidden / experimental / ... lakectl feature. Having such a tool means that we can work on the backend regardless of the GUI frontend. So it increases velocity by reducing dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

3. The diff itself will consist of metadata changes only, in the form of the operation histories of the two tables, and the change in the number of rows.
4. The diff will be a "three-dots" diff (like `git log branch1...branch2`). Basically showing the log changes that happened in one branch and not in the other.
5. UI: GUI only.
6. Reduce user friction as much as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that this is an effective goal: when do we see it making a difference to our implementation efforts? E.g., "add a combo to select a Delta table" will clearly reduce user friction, but is out of scope here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that it's in the context of the other goals and non-goals specified.


## Non-Goals and off-scope

1. The Delta diff will be limited to the available Delta Log entries (the JSON files).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support log compaction? (If we can, that's awesome and we should include it as a goal!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the support of log compaction is actually out of scope.

##### Diff as an external binary executable

We can trigger a diff binary from lakeFS using the `os.exec` package and get the output generated by that binary.
This is almost where we want to be, except that we'll need to somehow validate that the lakeFS server and the executable are using the same "API version" to communicate, so that the output of the binary would match the expected one from lakeFS. In addition, I would like to decrease the amount of deserialization needed to be implemented to interpret the returned DTO (maintainability and extensibility-wise).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why e.g. JSON or protobuf are a development overhead: We already use both inside lakeFS.

they cannot send them through the GUI (which is the UI we chose to implement this feature for the MVP).
To overcome this scenario, we'll use special diff credentials as follows:
1. User makes a Delta Diff request.
2. The DiffService checks if the user has "diff credentials" in the DB:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should do this, but not as part of the MVP. It is not worth blocking this project on that requirement.

That said, we might want to spec out that requirement and get the ball rolling.

1. User makes a Delta Diff request.
2. The DiffService checks if the user has "diff credentials" in the DB:
1. If there are such credentials, it will use them.
2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should get someone from the auth team to go over this!
  2. I would like us to declare this solution is explicitly temporary. There are at least 2 auth requirements in here:
    • Support a note on credentials
    • Allow temporary credentials.
    • (Possibly...) allow attaching a policy with reduced permissions to such credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This solution involved @guy-har. Adding as a reviewer...
  2. Absolutely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate for missing is good short term solution - it can also be an issue when two services will try to generate while currently we don't lock this option means we will have two sets of credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,
I'll mark the temporary solutions (for the MVP) all around the document.

Comment on lines 137 to 146
[
{
"version": 1,
"timestamp":1515491537026,
"operation":"INSERT",
"operationContent":{
"operationParameters": {
"mode":"Append",
"partitionBy":"[]"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this schema. If it's just the Delta schema, can you paste a link to it?


### Applicative Metrics
- Diff runtime
- The total number of versions returned - this is the number of files that were read from lakeFS. This is basically the size of a Delta Log. We can use it later on to optimize reading.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't compaction change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compaction == deleted history -> not part of the history command

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks promising and well thought though! Reviewing from phone so apologies if I'm reiterating on answered questions


The DiffService will be an internal component in lakeFS which will serve as the Core system.
In order to realize which diff plugins are available, we shall use new configuration values:
1. `LAKEFS_PLUGINS_LOCATION`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better if we use the same prefix for all configuration values related to plugins.

![Delta Diff flow](diagrams/delta-diff-flow.png)
[(excalidraw file)](diagrams/delta-diff-flow.excalidraw)

#### DiffService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one configures the plugins, i.e. passes configuration values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through the request to the plugin (think of Terraform providers as an example) and this


Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.
To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.
The diff algorithm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected amount of calls to the gateway? How much data is to be retrieved from the object store? Two things to consider:

  1. IIUC we have a request waiting in the background, we should be fast..
  2. How we scale.. If users perform several concurrent requests, will we choke?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't really know the amount of Delta Log files used on average (or median, or max) the amount of data is currently an unknown. There will be two calls to the S3 gateway- one history command for each table.
One of the metrics specified below is getting the number of files That Delta users use.
Regarding point 1, for the MVP, this process can be slow (the whole thing is tagged as experimental), but it will help us evaluate the concurrency and resources needed for the Diff.
Regarding point 2, scaling the requests from the Diff service to the diff plugin is pretty easy as it uses an HTTP/2 connection to communicate (we can run multiple streams on a single connection). If a user makes multiple requests for a diff we can implement a backoff in the plugin itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, is "history" a gateway command or a delta one? I'm not familiar with that gw endpoint so I'm guessing it's a delta command that executes s3 requests to the gw. I suggest we have a rough estimate for the number and type of calls now. I'm sure we'll be able to handle it but it's better to understand the costs sooner rather than later.
By scale I meant the number of calls to the gateway. If a single diff performs 100 calls that fetches data, then 10 diffs performance is something to consider. I agree that the service-plugin communication scale is not a concern at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.
So the delta history command is quite an expensive command- it will send a request to fetch each delta log (JSON) file one by one from the gateway. That means that if there are N log entries, then N requests will get sent to the gateway.
This is surely not the way we would want our diff to eventually work, but at this point (for the case of an MVP, and to verify that a diff is even needed) I don't think we should work towards making it efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 👍 to measure the number of calls a "simple" history performs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second applicative metric will basically do that.


### API

- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pagination/limitations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thought we need to provide a path to the table - there are two refs here
  • The location of otf is after the repository - what does it represents in this level? is it just a different diff operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itaiad200 I'll add a description to your points.

@nopcoder you're right, it should be otf/table/{left_table_path}/diff/{right_table_path}?type={diff_type}
The otf is to specify that this API is strictly for OTFs

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Added some comments and concerns.

## User Story

Jenny the data engineer runs a new retention-and-column-remover ETL over multiple Delta Lake tables. To test the ETL before running on production, she uses lakeFS (obviously) and branches out to a dedicated branch `exp1`, and runs the ETL pointed at `exp1`.
The output is not what Jenny planned... The Delta Table is missing multiple columns and the number of rows just got bigger!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought we are not addressing data (number of rows just got bigger) - just metadata.

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is optional (we know we are getting it back from the Delta History command as part of the metadata), and also this is what Jenny wants...


---

## Implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Think we should have just one numbering from the start of the flow to the end. Note that number one will be the user's request.
  • The request and the use of the diff plugin system should consider different type? so for example if we like to implement csv diff using the plugin system and a so the user request and plugin will identify which formats the plugin supports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ok
  2. Not sure I understand. What I meant in the diagram is that the request will include the type of diff that the user wishes to see. If it's a diff of CSV data, the user will send type=csv in the request, and the DiffService will realize that the CSV diff plugin is needed to create the diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about item 2 - I mean that a single plugin can serve multiple formats - so the mapping of the plugin system may look different. As we load a single set of plugins and each plugin publish the formats it supports or in the the configuration level map each plugin to one or more formats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this step should be decided by the plugin "port", e.g. "DiffService".
If plugin A can serve multiple formats, then the configurations might look like:

diff:
  delta:
    plugin: A
  csv:
    plugin: A
  iceberg:
    plugin: A
merge:
  delta:
    plugin: A
plugins:
  A: ~/.lakefs/plugins/A

At this point, every core (port) component can choose according to the specified format of the plugin it needs. And a plugin can be used multiple times for different formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bigger point here is that we will need some way for lakeFS to detect what type it should offer and/or request. But this is in no way blocking for the MVP, and I am perfectly happy to skip it for now! (For instance, we might discover that users want a diff but don't want to initiate it from lakeFS.)

![Delta Diff flow](diagrams/delta-diff-flow.png)
[(excalidraw file)](diagrams/delta-diff-flow.excalidraw)

#### DiffService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeltaDiffPlugin? or lakeFSDeltaDiffPlugin?
Think that DiffService is misleading as we are planing to use a plugin system to extend lakeFS diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DiffService is not the plugin itself, it's the component within lakeFS that interacts with the plugins.
What about PlugableDiffService?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got it - just didn't want to couple the plugin system to the DiffService. I would like that any service in lakeFS and access the plugins. This way we can implement merge and future capabilities as needed without adding another plugin system.

#### DiffService

The DiffService will be an internal component in lakeFS which will serve as the Core system.
In order to realize which diff plugins are available, we shall use new configuration values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There is no manifest / configuration to map which plugins to load? from reading this section we have location and the specific delta diff information. Suggest you define the plugin configuration without sharing the location.
    Example:
    plugins:
      diff:
        delta: <location to the delta diff plugin - full path to the binary needed to execute>    
  • In case we like to map multiple plugins to the same format we should consider separating the plugins configuration and the use cases. Like:
    diff:
      otf:
        delta:
          plugin: diff_delta_v2
    plugins:
      diff_delta_v1: <location to the delta diff plugin - full path to the binary needed to execute>    
      diff_delta_v2: <location to the delta diff plugin - full path to the binary needed to execute>    
  • Another consideration - if we plan to use the plugin later also to support advance merge - we can just drop the diff from all the above examples.
  • In case we would like to leverage more plugins for different use-cases (ex: merge) will it possible to share the same location for different plugin interfaces or we need to split / catalog them now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff:
  otf:
    delta:
      plugin: diff_delta_v2
plugins:
  diff_delta_v1: <location to the delta diff plugin - full path to the binary needed to execute>    
  diff_delta_v2: <location to the delta diff plugin - full path to the binary needed to execute>    

Really liked this one, although I would still want to have default values so that if the user didn't specify a plugin location in the manifest, lakeFS will try to figure it out using the latest known version of the plugin and try to locate it in some default path. For example, if the user didn't specify the location for diff.delta.plugin, lakeFS will try to locate the binary under ~/.lakefs/plugins.

  • Another consideration - if we plan to use the plugin later also to support advance merge - we can just drop the diff from all the above examples.
  • In case we would like to leverage more plugins for different use-cases (ex: merge) will it possible to share the same location for different plugin interfaces or we need to split / catalog them now?

I think that it's a very possible scenario, yet I think that it might be preferable that each plugin would know how to do one thing. Anyway, we can decide upon it when we get there...


#### Delta Diff Plugin

Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should specify the constraint of fetching large objects from delta-rs. Configurable (optional) location to store the temporary data downloaded by the plugin - will probably be required in case we don't want to use the default temporary folder or control/strict the size used by the plugin.
  • Cleanup or this storage if it is not managed by the OS temp dir/files API.
  • Calling the plugin with 'deadline' (timeout for the grpc call) just to prevent a lock during the call to the plugin.

1. User makes a Delta Diff request.
2. The DiffService checks if the user has "diff credentials" in the DB:
1. If there are such credentials, it will use them.
2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate for missing is good short term solution - it can also be an issue when two services will try to generate while currently we don't lock this option means we will have two sets of credentials.


### API

- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thought we need to provide a path to the table - there are two refs here
  • The location of otf is after the repository - what does it represents in this level? is it just a different diff operation?

Comment on lines +149 to +151
- operationContent:
- type: map
- description: an operation content specific to the table format implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just verify - what types the map will hold? we need to serialize any json object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map for Delta will hold similar data to the commitInfo structure's operationParameters section


- GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?type={diff_type}`
- Tagged as experimental
- **Response**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response format render log of operations - maybe we can scope it at this level - so if the next change will return a response with diff - we can just return a empty list/log here.


---

### Packaging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Build and packaging - we need to address the build part too

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T-h-a-n-k-s-!


The DiffService will be an internal component in lakeFS which will serve as the Core system.
In order to realize which diff plugins are available, we shall use new configuration values:
1. `LAKEFS_PLUGINS_LOCATION`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. So everything below should also be LAKEFS_PLUGINS_LOCATIONS. That means that everything lives inside the same configuration object.

Also as specificed this might not really be a plugin: I don't know whether Viper will let us configure an "any object" and be able to pass it usefully to the plugin during load.

If this turns out to be impossible, I would suggest configuring a separate "plugins file"; that one can be a general YAML/JSON object with very few parsed fields, and then each plugin gets its configuration subobject from that file (and presumably parses it using mapstructure in Go, or libwhatever in other languages).

2. `LAKEFS_PLUGIN_DIFF_{TYPE OF DIFF}`
The DiffService will use the name of binary provided by this env var to perform the diff. For instance, `LAKEFS_PLUGIN_DIFF_DELTA=deltaDiffBinary`
The DiffService will use the [plugins' location](https://github.com/hashicorp/terraform/blob/main/plugins.go) and the diff binary to load the plugin and request a diff.
The **type** of diff will be sent as part of the request to lakeFS as specified [here](#API).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

#### Delta Diff Plugin

Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.
To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a future advantage to contributing our own client to Delta?


Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.
To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.
The diff algorithm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 👍 to measure the number of calls a "simple" history performs.

Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the design - looks well thought through.
I don't have much to add to the already existing comments.
One thing though, maybe you can reduce the images sizes? For example the delta-diff-flow image can't be read from the document and when opening it in a different window it pops up very big making it hard to see the entire flow

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left couple of comments - the concern I left was related to plugin system we are adding - I think it should serve more than just the diff if possible. I assume the next service will be called MergeService and it may use the same or different set of plugins and I think it is a place to use a single system that can serve different functionality.

@Jonathan-Rosenberg
Copy link
Contributor Author

@nopcoder your concern is a very valid one.
I think that this design still answers it.
Plugins can be developed and used in a variety of ways with different services (Merge, Diff, Rebase...) by setting the correct configuration file fields:

diff: // service/functionality that uses the plugins
  delta:
    plugin: A
  csv:
    plugin: A
  iceberg:
    plugin: A
merge: // service/functionality that uses the plugins
  delta:
    plugin: A
rebase: // service/functionality that uses the plugins
  parquet:
    plugin: B
plugins:
  A: ~/.lakefs/plugins/A
  B: ~/.lakefs/plugins/B

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The Microkernel/Plugin architecture is composed of two entities: a single "core system" and multiple "plugins".
In our case, the lakeFS server act as the core system, and the different diff implementations, including the Delta diff implementation, will act as plugins.
We'll use `gRPC` as the transport protocol, which makes the language of choice almost immaterial (due to the use of protobufs as the transferred data format)
as long as it's self-contained (theoretically it can also be not system-runtime-dependent but then the cost will be an added requirement for running lakeFS- runtime support).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term we will need and have a generic external service interface: think of a user who wants to add or modify a diff/merge service. I believe they will prefer a looser coupling.
Short- and medium-term I agree with you that using go-plugin makes sense if we believe that it shortens initial development time.

Counterpoint: If we implement the plugin service as a separate REST microservice, can we drive it directly from the GUI, and then we don't need to add anything to lakeFS (server)? Might skip having to agree with team VE about configuration, adding go-plugin, packaging, etc.


---

## Implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bigger point here is that we will need some way for lakeFS to detect what type it should offer and/or request. But this is in no way blocking for the MVP, and I am perfectly happy to skip it for now! (For instance, we might discover that users want a diff but don't want to initiate it from lakeFS.)

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@Jonathan-Rosenberg Jonathan-Rosenberg merged commit 2a338f0 into master Jan 8, 2023
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the design/delta-diff branch January 8, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delta Lake: Diff service design
7 participants