Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta diff: client side implementation of an expanded delta lake diff #5232

Merged
merged 11 commits into from
Feb 16, 2023

Conversation

talSofer
Copy link
Contributor

Linked Issue

Closes #5001


Change Description

New Feature

Add a GUI for showing the diff between two delta lake tables. the delta lake table diff view shows the diff for a single table at each point and colors the operations according to their type.

Testing Details

Tested manually the compare, commits and uncommitted changes views.
see attached screenshots:
Screen Shot 2023-02-14 at 10 19 17
Screen Shot 2023-02-14 at 10 19 09
Screen Shot 2023-02-14 at 10 18 57

Additional info


@talSofer talSofer added the exclude-changelog PR description should not be included in next release changelog label Feb 14, 2023
Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and reacty.
Have some comments.

@@ -477,6 +477,20 @@ class Repositories {
throw new Error(await extractError(response));
}
}

async otfDiff(repoId, leftRef, rightRef, table_path = "", type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async otfDiff(repoId, leftRef, rightRef, table_path = "", type) {
async otfDiff(repoId, leftRef, rightRef, tablePath = "", type) {

Not a fan of pushing commented out code, but if you're going with it, the argument should be commented out - or used in some fake way - or given as _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed naming, as for commenting out the code - i'm going to change in literally a few days so I think it is ok to keep it this way for now,


.table-operation-expansion {
color: black;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// The list of available operations is based on: https://docs.databricks.com/delta/history.html#operation-metrics-keys
const deltaLakeOperationToDiffType = new Map([
["WRITE" , DiffType.Added],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like server side logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a Map and not a simple JS object?

Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we've agreed that the type of diff will be returned from lakeFS within the response.
Created a PR to reflect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnyaug good call, thanks! changed the client to use the operation type returned from the response

Comment on lines 96 to 97
const diffType = deltaLakeOperationToDiffType.get(operation);
return diffType !== undefined ? diffType : DiffType.Changed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const diffType = deltaLakeOperationToDiffType.get(operation);
return diffType !== undefined ? diffType : DiffType.Changed;
return deltaLakeOperationToDiffType.get(operation) || DiffType.Changed;

(and then this function is not really needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

]);

export const TableDiff = ({repo, leftRef, rightRef, tablePath}) => {
let { error, loading, response } = useAPI(() => repositories.otfDiff(repo.id, leftRef, rightRef, tablePath, OtfType.Delta), [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point this element becomes a DeltaDiff rather than a TableDiff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, renamed accordingaly

<td className="pl-lg-10 col-10 table-operation-details">
<strong>Timestamp:</strong> {operationTimestamp}
<br/>
<strong>Commit Info:</strong>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the commit info here? Is it a lakeFS commit? (I hope so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit info is the terminology used in the delta lake transaction log. see the definition in https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html

Commit info - Contains information around the commit, which operation was made, from where and at what time.

Comment on lines 109 to 110
const JSONContentStr = JSON.stringify(content, null, 2);
const JSONContent = JSON.parse(JSONContentStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of these 2 lines? It looks like a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is not needed, because I'm parsing here the following object included in the otfDiff response:

operation_content:
          type: object
          description: free form content describing the returned operation diff

and the object is a map of key-value pairs, so I don't need to convert it into a json first.
@Jonathan-Rosenberg should we clarify it in API doc? in it not relly a free form but a collection of key-value pairs.

parsedContent += `${key}: `
let parsedVal = "";
try {
parsedVal = JSON.parse(val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will look like [object Object] if val is parsed to a Javascript object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, it looks like this -
"[\"(spark_catalog.delta.lakefs://meetup-repo/experiment-2-branch/raw/.rating < 4.0D) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\"]"

parsed into

predicate: (spark_catalog.delta.lakefs://meetup-repo/experiment-2-branch/raw/.rating < 4.0D) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

did you mean something else?

Copy link
Contributor

@johnnyaug johnnyaug Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code takes the result of JSON.parse and concatenates it with a string, consequently calling toString on the result, which returns [object Object] for a POJO.

Comment on lines 107 to 122
function parseOperationContent(content) {
let parsedContent = "";
const JSONContentStr = JSON.stringify(content, null, 2);
const JSONContent = JSON.parse(JSONContentStr);
for (let key in JSONContent) {
const val = JSONContent[key];
parsedContent += `${key}: `
let parsedVal = "";
try {
parsedVal = JSON.parse(val);
} catch (err) {
parsedVal = val;
}
parsedContent += parsedVal + "\n";
}
return parsedContent;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separation of concerns - convert the object to what you want, handle the printing/formatting elsewhere.

Suggested change
function parseOperationContent(content) {
let parsedContent = "";
const JSONContentStr = JSON.stringify(content, null, 2);
const JSONContent = JSON.parse(JSONContentStr);
for (let key in JSONContent) {
const val = JSONContent[key];
parsedContent += `${key}: `
let parsedVal = "";
try {
parsedVal = JSON.parse(val);
} catch (err) {
parsedVal = val;
}
parsedContent += parsedVal + "\n";
}
return parsedContent;
function safeJsonParse(val) {
try {
return JSON.parse(val);
}
catch (err) {
return val;
}
}
function safeJsonParseValues(obj) {
Object.keys(obj).forEach(k => {
obj[k] = safeJsonParse(obj[k]);
})
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -195,7 +207,9 @@ export const ChangesTreeContainer = ({results, showExperimentalDeltaDiffButton =
</span>
</Card.Header>
<Card.Body>
<Table borderless size="sm">
{tableDiffState.isExpanded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we went with separate Table/Object diff screens (or am I reading the code wrong?).

Suggested change
{tableDiffState.isExpanded
{tableDiffState.isShown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you are good, done thx

Comment on lines 11 to 12
Added = "added",
Removed = "removed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Added = "added",
Removed = "removed",
Created = "created",
Dropped = "dropped",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I changed the enum to represent table operation types to be consistent with the server

// }
// return response.json();
// const mockRes = '{"results": []}'
const mockRes = '{"results": [{"version": "1", "timestamp": 1515491537026, "operation": "INSERT", "operation_content": {"mode": "Append","partitionBy": "[]"}}, {"version": "2", "timestamp": 1515491537346, "operation": "DELETE", "operation_content": {"mode": "Append","partitionBy": "[]"}}, {"version": "14", "timestamp": 1674393286223, "operation": "DELETE", "operation_content": {"predicate": "[\\"(spark_catalog.delta.lakefs://meetup-repo/experiment-2-branch/raw/.`rating` < 4.0D) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\"]"}}]}'
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're missing DiffType:

Suggested change
const mockRes = '{"results": [{"version": "1", "timestamp": 1515491537026, "operation": "INSERT", "operation_content": {"mode": "Append","partitionBy": "[]"}}, {"version": "2", "timestamp": 1515491537346, "operation": "DELETE", "operation_content": {"mode": "Append","partitionBy": "[]"}}, {"version": "14", "timestamp": 1674393286223, "operation": "DELETE", "operation_content": {"predicate": "[\\"(spark_catalog.delta.lakefs://meetup-repo/experiment-2-branch/raw/.`rating` < 4.0D) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\"]"}}]}'
const mockRes = '{"diff_type": "changed", "results": [{"id": "1", "timestamp": 1515491537026, "operation": "INSERT", "operation_content": {"mode": "Append","partitionBy": "[]"}}, {"version": "2", "timestamp": 1515491537346, "operation": "DELETE", "operation_content": {"mode": "Append","partitionBy": "[]"}}, {"version": "14", "timestamp": 1674393286223, "operation": "DELETE", "operation_content": {"predicate": "[\\"(spark_catalog.delta.lakefs://meetup-repo/experiment-2-branch/raw/.`rating` < 4.0D) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\"]"}}]}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expanded delta diff view
3 participants