Skip to content

Commit f0f1342

Browse files
committed
Added TerminusDB blog posts from technical-blogs
1 parent e14eead commit f0f1342

File tree

13 files changed

+2699
-264
lines changed

13 files changed

+2699
-264
lines changed

package-lock.json

Lines changed: 14 additions & 264 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
title: "What's the Difference: JSON diff and patch"
3+
nextjs:
4+
metadata:
5+
title: "What's the Difference: JSON diff and patch"
6+
description: "How JSON diff and patch enable distributed data collaboration for Web3, offering multi-master updates without complex protocols or locks."
7+
keywords: JSON diff, JSON patch, distributed data, Web3, multi-master, version control, CRDT, data collaboration
8+
alternates:
9+
canonical: https://terminusdb.org/blog/2022-02-14-json-diff-and-patch/
10+
openGraph:
11+
images: https://assets.terminusdb.com/docs/technical-documentation-terminuscms-og.png
12+
media: []
13+
---
14+
15+
> Author: Gavin Mendel Gleason
16+
17+
What will the distributed data environment in Web3 look like?
18+
19+
How will we have a distributed network of data stores which allow updates and synchronizations?
20+
21+
What is it that allows git to perform distributed operations on text so effectively?
22+
23+
Is it possible to do the same for structured data?
24+
25+
## Web3
26+
27+
These questions are really at the heart of the distributed part of web3. Web3 has other parts: immutability, cryptographic security, etc. But these other elements do not answer how to perform updates on distributed data stores.*
28+
29+
In seeking the answer to these questions I was led to see a rather simple tool as foundational: [JSON diff and patch](https://terminusdb.com/products/jsondiff/).
30+
31+
JSON, because JSON is the structured data format for the web. This will continue to be true for Web3. Everyone uses JSON for just about everything in our web architecture. Other formats are going to be increasingly used as mere optimizations of JSON. Associative arrays have the beauty of (reasonable) human readability, combined with widespread native support in modern computer programming languages. Both computers and humans can read it, what’s not to love!
32+
33+
But what about the diff and patch part?
34+
35+
## The *use case* for diff and patch
36+
37+
A fundamental tool in git’s strategy for distributed management of source code is the concept of the diff and the patch. These foundational operations are what make git possible. Diff is used to construct a patch that can be applied to an object such that the final state makes sense for some value of makes sense.
38+
39+
The application of patches happens because we want a certain before state to be lifted to a certain after state. The patch doesn’t specify everything. Only what it expects to be true of the source, and what it expects to be true after the update.
40+
41+
With this, it’s possible to have distributed updates performed on different parts of source text. Collisions result in some remedial action being required, but if there are no collisions everything can be merged to obtain a final state which respects all updates, no matter when or where they came from.
42+
43+
This is what allows git to be fully multi-master, without requiring or forcing synchronization using any complex protocols (like RAFT).
44+
45+
## Diff and patch in structured data
46+
47+
Do similar situations arise with structured data?
48+
49+
Definitely.
50+
51+
Let's imagine an object which stores information about items in our
52+
online store.
53+
54+
```javascript
55+
{ "id" : 13234,
56+
"name" : "Retro Encabulator Mark 2",
57+
"description" : "The Retro Encabulator Mark II is the lastest
58+
development of the Retro Encabulator used to
59+
generate inverse reactive current for unilateral
60+
phase detractors."
61+
"category" : "Cardinal Grammeter Synchronisers",
62+
"price" : { "value" : "3430.23", "currency" : "Euro" }},
63+
"stock" : 32,
64+
"suppliers" : ["Supplier/123","Supplier/4332"] }
65+
```
66+
67+
If Alice opens the object in an application and changes the name of
68+
the item to "Retro Encabulator Mark II", it should be possible for Bob
69+
to update the suppliers list simultaneously without either stepping
70+
on each other's toes.
71+
72+
In applications, this sort of curation operation is often achieved with
73+
a *lock* on the object. Which means only one person can win. And locks
74+
are a massive source of pain, not only because you can't achieve
75+
otherwise perfectly reasonable concurrent operations, but because you
76+
risk getting stale locks and having to figure out when to release them.
77+
78+
But what if Sally didn't submit her whole object for update, but only
79+
the part she wanted to be changed? And Bob did the same?
80+
81+
Now we can perform the updates in three different places, locally for
82+
Alice, locally for Bob, and then finally at a shared server resource.
83+
84+
The structured patch could be determined by looking at the object
85+
*before* Alice submitted it, and after, using `diff`. The patch
86+
constructed from Alice's diff might look like this:
87+
88+
```javascript
89+
{ "name" : { "@before" : "Retro Encabulator Mark 2",
90+
"@after" : "Retro Encabulator Mark II"}}
91+
```
92+
93+
And Bob's might look like:
94+
95+
```javascript
96+
{ "suppliers" : { "@before" : ["Supplier/123","Supplier/4332"],
97+
"@after" : ["Supplier/123","Supplier/4332",
98+
"Supplier/385"]}}
99+
```
100+
101+
Now both can apply cleanly to the original document listed above. We
102+
can stack either patch in any order without difficulty. Perhaps we ask
103+
Bob and Alice to agree on the application order (using pull / push as
104+
is done with git). But maybe we just allow them to apply when they
105+
arrive. The answer depends on the workflow.
106+
107+
## Conflict
108+
109+
But what if Mary comes in before Alice and submits the following
110+
patch:
111+
112+
```javascript
113+
{ "name" : { "@before" : "Retro Encabulator Mark 2",
114+
"@after" : "Retro Encabulator Mark two"}}
115+
```
116+
117+
We have a problem. But we see immediately that the two are in conflict
118+
and Alice can be asked to resolve the question by surfacing it. In the
119+
case of data curation, this is a perfectly reasonable workflow. And it
120+
is this problem of data curation that we can solve with the simplest
121+
version of JSON diff.
122+
123+
This conflict can be surfaced to Alice, and Bob can be allowed to go
124+
about his business. Could this particular problem be resolved in a
125+
purely automatic way with a CRDT? Definitely, but it probably will not
126+
result in what you want. Last first will work of course, but then
127+
which is *more right* might need human review, and even worse it might
128+
result in both results being interleaved (a likely outcome!).
129+
130+
We *could* make the before and after, however, be a text-based patch
131+
using a textual diff. Probably gits line-based approach is *not* what
132+
we want here, but rather one that takes words as atoms. It will not
133+
solve this particular conflict, but it could make text fields much
134+
more flexible.
135+
136+
Which of these you want, however, requires *semantic direction* of the
137+
diff algorithm. While lots of structured diff problems will be solved
138+
by the simplest algorithm, ultimately we need to have a schema that
139+
helps to direct the meaning of our diffs. String fields might be best
140+
line-based, word-based, or perhaps they must always be atomic (as with
141+
identifiers).
142+
143+
## Patch is simpler than Diff
144+
145+
Patch is actually the simpler operation. Patch application basically
146+
just checks that the read state matches, and then substitutes the
147+
writes.
148+
149+
Diff, by contrast, has to calculate, and often in practice *guess* a
150+
good transition from the read state to the write state. The specific
151+
tuning of the patch provided by a diff is dependent on the needs of
152+
the application. There are *generic* algorithms that can work decently
153+
for a range of applications, but there is no one size fits all. This
154+
is why we will need the *semantic direction* which can be provided by
155+
a schema.
156+
157+
Diff is also computationally *much* more expensive. Finding the
158+
minimal change means finding the maximal similarity. As it turns out,
159+
this is pretty easy for the skeleton of a JSON dictionary, but rather
160+
a pain for lists, and strings. And for lists of lists... Well, I'll
161+
get into that later.
162+
163+
Let's just say it's no exaggeration that you can easily wander into
164+
the heat-death of the universe. Hence heuristics have to be part of
165+
any fully automatic diff.
166+
167+
## A Complex Patch gives rise to Distributed Transactions
168+
169+
But there are other workflows that might want a slightly more flexible
170+
approach to ensuring data integrity. The *before* state is really
171+
sitting there to specify the *read object model*. It tells us what we
172+
want to be true when we apply the patch.
173+
174+
With git, this might be lines of text. For instance, to change a very
175+
simple `README.txt` which initially says `hello world` to one that
176+
says `hello squirrels`, git will produce a patch that looks something
177+
like the following:
178+
179+
```diff
180+
index 3b18e51..3a9ea5d 100644
181+
--- a/README.txt
182+
+++ b/README.txt
183+
@@ -1 +1 @@
184+
-hello world
185+
+hello squirrels
186+
--
187+
2.32.0
188+
```
189+
190+
This isn't the most compact patch, and it will conflict if hello were
191+
changed to some other word, for instance `greetings` perhaps. The
192+
reason that it works well for git is that lines of text are a somewhat
193+
reasonable granularity for programming languages.
194+
195+
But the before and after don't have to be lines or words. The before
196+
could be any specification of the read state. For a bank account
197+
withdrawal, we might ask for the before state to be larger than, or
198+
equal to the after state. This would be a nice little transaction for
199+
ensuring we don't overdraw.
200+
201+
Or perhaps we want the before state to be specified with a regex? Or
202+
maybe we read a *lot* of values in order to calculate a further value
203+
in the object, in which case we want to know that *none* of these
204+
values change.
205+
206+
This approach gives us a kind of read isolation that is *tuned* to
207+
the use-case we're actually working with. Making patch the unit of
208+
update gives us just the right granularity for our application, which
209+
really can't be known in advance.
210+
211+
This is an advancement beyond the sort of isolation options usually
212+
provided by a database, and one that extends naturally to objects or
213+
graphs of interconnected objects (as exists in TerminusDB).
214+
215+
## What we have and where we are going
216+
217+
I've implemented a simple JSON diff and patch in TerminusX. But we're
218+
also working on the extensions of this to those specified by a
219+
schema. It's also easy to implement and very interesting to imagine a
220+
full space of patches, many of which could never be determined by a
221+
diff, but which would be extremely handy to have for distributed
222+
transactions over document stores. We will be adding these various
223+
operations as we run into use-cases in practice, but we're also very
224+
keen to hear about use cases that people have already encountered in
225+
the wild. Do let me know!
226+
227+
<a name="crdt">*</a> CRDTs answer this question for certain types of
228+
data structures - but not for all. Only certain *types* of
229+
data structures can be updated with these approaches. In addition, many
230+
updates require human aided review and will never require a
231+
CRDT. Still others will have *object read model* conditions that can
232+
not be specified in a CRDT. Ultimately our databases should support a
233+
range of distributed datatypes including CRDT.

0 commit comments

Comments
 (0)