dfrnt-labs
diff --git a/‎package-lock.json‎
Lines changed: 14 additions & 264 deletions b/‎package-lock.json‎
Lines changed: 14 additions & 264 deletions
diff --git a/‎src/app/blog/2022-02-14-json-diff-and-patch/page.md‎
Lines changed: 233 additions & 0 deletions b/‎src/app/blog/2022-02-14-json-diff-and-patch/page.md‎
Lines changed: 233 additions & 0 deletions
@@ -0,0 +1,233 @@
+---
+title: "What's the Difference: JSON diff and patch"
+nextjs:
+  metadata:
+    title: "What's the Difference: JSON diff and patch"
+    description: "How JSON diff and patch enable distributed data collaboration for Web3, offering multi-master updates without complex protocols or locks."
+    keywords: JSON diff, JSON patch, distributed data, Web3, multi-master, version control, CRDT, data collaboration
+    alternates:
+      canonical: https://terminusdb.org/blog/2022-02-14-json-diff-and-patch/
+    openGraph:
+      images: https://assets.terminusdb.com/docs/technical-documentation-terminuscms-og.png
+media: []
+---
+
+> Author: Gavin Mendel Gleason
+
+What will the distributed data environment in Web3 look like?
+
+How will we have a distributed network of data stores which allow updates and synchronizations?
+
+What is it that allows git to perform distributed operations on text so effectively?
+
+Is it possible to do the same for structured data?
+
+## Web3
+
+These questions are really at the heart of the distributed part of web3. Web3 has other parts: immutability, cryptographic security, etc. But these other elements do not answer how to perform updates on distributed data stores.*
+
+In seeking the answer to these questions I was led to see a rather simple tool as foundational: [JSON diff and patch](https://terminusdb.com/products/jsondiff/).
+
+JSON, because JSON is the structured data format for the web. This will continue to be true for Web3. Everyone uses JSON for just about everything in our web architecture. Other formats are going to be increasingly used as mere optimizations of JSON. Associative arrays have the beauty of (reasonable) human readability, combined with widespread native support in modern computer programming languages. Both computers and humans can read it, what’s not to love!
+
+But what about the diff and patch part?
+
+## The *use case* for diff and patch
+
+A fundamental tool in git’s strategy for distributed management of source code is the concept of the diff and the patch. These foundational operations are what make git possible. Diff is used to construct a patch that can be applied to an object such that the final state makes sense for some value of makes sense.
+
+The application of patches happens because we want a certain before state to be lifted to a certain after state. The patch doesn’t specify everything. Only what it expects to be true of the source, and what it expects to be true after the update.
+
+With this, it’s possible to have distributed updates performed on different parts of source text. Collisions result in some remedial action being required, but if there are no collisions everything can be merged to obtain a final state which respects all updates, no matter when or where they came from.
+
+This is what allows git to be fully multi-master, without requiring or forcing synchronization using any complex protocols (like RAFT).
+
+## Diff and patch in structured data
+
+Do similar situations arise with structured data?
+
+Definitely.
+
+Let's imagine an object which stores information about items in our
+online store.
+
+```javascript
+{ "id" : 13234,
+  "name" : "Retro Encabulator Mark 2",
+  "description" : "The Retro Encabulator Mark II is the lastest 
+                   development of the Retro Encabulator used to 
+                   generate inverse reactive current for unilateral 
+                   phase detractors."
+  "category" : "Cardinal Grammeter Synchronisers",
+  "price" : { "value" : "3430.23", "currency" : "Euro" }},
+  "stock" : 32,
+  "suppliers" : ["Supplier/123","Supplier/4332"] }
+```
+
+If Alice opens the object in an application and changes the name of
+the item to "Retro Encabulator Mark II", it should be possible for Bob
+to update the suppliers list simultaneously without either stepping
+on each other's toes.
+
+In applications, this sort of curation operation is often achieved with
+a *lock* on the object. Which means only one person can win. And locks
+are a massive source of pain, not only because you can't achieve
+otherwise perfectly reasonable concurrent operations, but because you
+risk getting stale locks and having to figure out when to release them.
+
+But what if Sally didn't submit her whole object for update, but only
+the part she wanted to be changed? And Bob did the same?
+
+Now we can perform the updates in three different places, locally for
+Alice, locally for Bob, and then finally at a shared server resource.
+
+The structured patch could be determined by looking at the object
+*before* Alice submitted it, and after, using `diff`. The patch
+constructed from Alice's diff might look like this:
+
+```javascript
+{ "name" : { "@before" : "Retro Encabulator Mark 2",
+             "@after" :  "Retro Encabulator Mark II"}}
+```
+
+And Bob's might look like:
+
+```javascript
+{ "suppliers" : { "@before" : ["Supplier/123","Supplier/4332"],
+                  "@after" :  ["Supplier/123","Supplier/4332",
+                               "Supplier/385"]}}
+```
+
+Now both can apply cleanly to the original document listed above. We
+can stack either patch in any order without difficulty. Perhaps we ask
+Bob and Alice to agree on the application order (using pull / push as
+is done with git). But maybe we just allow them to apply when they
+arrive. The answer depends on the workflow.
+
+## Conflict
+
+But what if Mary comes in before Alice and submits the following
+patch:
+
+```javascript
+{ "name" : { "@before" : "Retro Encabulator Mark 2",
+             "@after" :  "Retro Encabulator Mark two"}}
+```
+
+We have a problem. But we see immediately that the two are in conflict
+and Alice can be asked to resolve the question by surfacing it. In the
+case of data curation, this is a perfectly reasonable workflow. And it
+is this problem of data curation that we can solve with the simplest
+version of JSON diff.
+
+This conflict can be surfaced to Alice, and Bob can be allowed to go
+about his business. Could this particular problem be resolved in a
+purely automatic way with a CRDT? Definitely, but it probably will not
+result in what you want. Last first will work of course, but then
+which is *more right* might need human review, and even worse it might
+result in both results being interleaved (a likely outcome!).
+
+We *could* make the before and after, however, be a text-based patch
+using a textual diff. Probably gits line-based approach is *not* what
+we want here, but rather one that takes words as atoms. It will not
+solve this particular conflict, but it could make text fields much
+more flexible.
+
+Which of these you want, however, requires *semantic direction* of the
+diff algorithm. While lots of structured diff problems will be solved
+by the simplest algorithm, ultimately we need to have a schema that
+helps to direct the meaning of our diffs. String fields might be best
+line-based, word-based, or perhaps they must always be atomic (as with
+identifiers).
+
+## Patch is simpler than Diff
+
+Patch is actually the simpler operation. Patch application basically
+just checks that the read state matches, and then substitutes the
+writes.
+
+Diff, by contrast, has to calculate, and often in practice *guess* a
+good transition from the read state to the write state. The specific
+tuning of the patch provided by a diff is dependent on the needs of
+the application. There are *generic* algorithms that can work decently
+for a range of applications, but there is no one size fits all. This
+is why we will need the *semantic direction* which can be provided by
+a schema.
+
+Diff is also computationally *much* more expensive. Finding the
+minimal change means finding the maximal similarity. As it turns out,
+this is pretty easy for the skeleton of a JSON dictionary, but rather
+a pain for lists, and strings. And for lists of lists... Well, I'll
+get into that later.
+
+Let's just say it's no exaggeration that you can easily wander into
+the heat-death of the universe. Hence heuristics have to be part of
+any fully automatic diff.
+
+## A Complex Patch gives rise to Distributed Transactions
+
+But there are other workflows that might want a slightly more flexible
+approach to ensuring data integrity. The *before* state is really
+sitting there to specify the *read object model*. It tells us what we
+want to be true when we apply the patch.
+
+With git, this might be lines of text. For instance, to change a very
+simple `README.txt` which initially says `hello world` to one that
+says `hello squirrels`, git will produce a patch that looks something
+like the following:
+
+```diff
+index 3b18e51..3a9ea5d 100644
+--- a/README.txt
++++ b/README.txt
+@@ -1 +1 @@
+-hello world
++hello squirrels
+-- 
+2.32.0
+```
+
+This isn't the most compact patch, and it will conflict if hello were
+changed to some other word, for instance `greetings` perhaps. The
+reason that it works well for git is that lines of text are a somewhat
+reasonable granularity for programming languages.
+
+But the before and after don't have to be lines or words. The before
+could be any specification of the read state. For a bank account
+withdrawal, we might ask for the before state to be larger than, or
+equal to the after state. This would be a nice little transaction for
+ensuring we don't overdraw.
+
+Or perhaps we want the before state to be specified with a regex? Or
+maybe we read a *lot* of values in order to calculate a further value
+in the object, in which case we want to know that *none* of these
+values change.
+
+This approach gives us a kind of read isolation that is *tuned* to
+the use-case we're actually working with. Making patch the unit of
+update gives us just the right granularity for our application, which
+really can't be known in advance.
+
+This is an advancement beyond the sort of isolation options usually
+provided by a database, and one that extends naturally to objects or
+graphs of interconnected objects (as exists in TerminusDB).
+
+## What we have and where we are going
+
+I've implemented a simple JSON diff and patch in TerminusX. But we're
+also working on the extensions of this to those specified by a
+schema. It's also easy to implement and very interesting to imagine a
+full space of patches, many of which could never be determined by a
+diff, but which would be extremely handy to have for distributed
+transactions over document stores. We will be adding these various
+operations as we run into use-cases in practice, but we're also very
+keen to hear about use cases that people have already encountered in
+the wild. Do let me know!
+
+<a name="crdt">*</a> CRDTs answer this question for certain types of
+data structures - but not for all. Only certain *types* of
+data structures can be updated with these approaches. In addition, many
+updates require human aided review and will never require a
+CRDT. Still others will have *object read model* conditions that can
+not be specified in a CRDT. Ultimately our databases should support a
+range of distributed datatypes including CRDT.