GENERALIZATION: The core idea of SMT that you need to understand is, how we can avoid recomputation of process if something didn't change the latent-meaning? Particularly, in the code example, we use a merkle tree to hash a system-folder tree, and apply semantic integrity. But this similar concept could be applied to many ideas, so please, don't limit to this use case and share what you can create with this first-principle idea of semantic integrity!
Mathematical Explanation
For a more technical perspective, here's how semantic integrity generalizes the classic Merkle tree:
Traditional Merkle trees use hash functions that are discrete and ideally injectiveβmeaning each unique input produces a unique output. In contrast, a Semantic Merkle Tree (SMT) introduces approximate or fuzzy hashing: different inputs with similar meaning can map to the same output. Thus, the SMT hash function is not strictly injective; multiple ( X ) can map to the same ( Y ) if their meanings are close enough.
Let
- If
$\varepsilon = 1$ , we recover the classic Merkle tree: any change is significant. - If
$\varepsilon < 1$ , we allow for semantic equivalence: only changes that alter the meaning beyond the threshold are considered significant.
Formally:
Given two versions of content,
where
The update rule is:
Note: GitHub now supports native math rendering in Markdown using
$...$
for inline and$$...$$
for block math, so the above will render correctly on GitHub (see announcement).
This approach allows the Merkle tree to ignore changes that do not alter the latent meaning of the content, thus reducing unnecessary recomputation and propagation.
Aspect | In One Sentence |
---|---|
What it is | A hierarchical map of embeddings (latent-meaning vectors) arranged like a Merkle tree, where each node also stores a conventional hash |
What it does | After an edit, it decides whether the node's meaning moved beyond a threshold Ξ΄; if not, it leaves the original hash and cached documentation untouched, preventing needless upstream work |
What problem it solves | Eliminates expensive or time-wasting reactions to purely cosmetic changes (typos, whitespace, re-ordering) while still catching genuine semantic drift |
Traditional Merkle trees and file monitoring systems treat any change as significant:
(In this following example, just we add ''.'' in the end.)
- The quick brown fox jumps over the lazy dog
+ The quick brown fox jumps over the lazy dog.
βοΈ One period added β Entire CI/CD pipeline triggered β Thousands of dollars in compute costs
Semantic Merkle Tree says: "Wait, the meaning didn't change!"
π‘ How It Works (This particular case is SMT applied to file-system tree, but could be generalized!)
- π Content Analysis: Each file is converted into a semantic embedding vector using AI models
- π Similarity Check: When files change, SMT compares the meaning using cosine similarity
- β‘ Smart Updates: Only propagates changes when semantic similarity drops below threshold
ΞΈ
(default: 0.70) - π² Tree Structure: Maintains traditional Merkle tree benefits while adding semantic awareness
- Python 3.12+
- Poetry (recommended) or pip
# Clone the repository
git clone https://github.com/octaviusp/semantic-merkle-tree.git
cd semantic-merkle-tree
# Install dependencies
poetry install
sentence-transformers
- AI embeddingsnumpy
- Vector operationstqdm
- Progress bars
python main.py build example_folder/
Output:
building: 100%|ββββββββββββ| 2/2 [00:01<00:00, 1.23it/s]
Build complete. root_hash = a1b2c3d4e5f6...
Edit a file in example_folder/
:
echo "I was running down the street" > example_folder/1.txt
python main.py verify example_folder/
Output:
verifying: 100%|ββββββββββββ| 2/2 [00:01<00:00, 1.45it/s]
Cosine similarity for 'example_folder/1.txt': 0.892156
No semantic changes.
root_hash = a1b2c3d4e5f6...
π Notice: Despite the text change ("in" β "down"), the similarity (0.89) is above our threshold (0.70), so SMT considers it semantically equivalent!
π° Cost Savings: Skip expensive operations when only formatting changes!
π Results: 70-90% reduction in LLM API costs for documentation workflows!
π Policy Document Tracking
Scenario | Traditional System | With SMT |
---|---|---|
Typo fix in policy | π¨ Alert sent to legal team | β No alert (semantic similarity: 0.98) |
Formatting change | π¨ Full review process triggered | β Ignored (cosmetic change) |
Actual policy change | π¨ Alert sent | π¨ Alert sent (semantic similarity: 0.45) |
β±οΈ Time Saved: Legal teams focus on substance, not formatting!
The semantic threshold ΞΈ
(theta) controls sensitivity:
# In main.py
THETA = 0.70 # Default: 70% similarity required
# More sensitive (catches smaller changes)
THETA = 0.85 # 85% similarity required
# Less sensitive (ignores more changes)
THETA = 0.50 # 50% similarity required
SMT uses all-MiniLM-L6-v2
by default, but you can swap models:
# Faster, smaller model
MODEL = SentenceTransformer("all-MiniLM-L12-v2")
# More accurate, larger model
MODEL = SentenceTransformer("all-mpnet-base-v2")
# Multilingual support
MODEL = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
-
False detections: Regardless of being a good technique for some use-cases, there are open challenge problems, like, what would happen if you have false positives, you need to use a extremely well trained semantic criteria for critic use-cases, we recommend to use state-of-the-art embedding models to lowest probability of errors.
-
Epsilon search: Is very hard to search a perfect epsilon threshold for every application, so for your use case, you need to try-&-error how to finetune this hyperparameter to seek the best trade-off.
This project is licensed under the MIT License - see the LICENSE file for details.
- Sentence Transformers team for simple and fast embedding models
- Merkle Tree inventors for the foundational data structure
- Open source community for inspiration and feedback
- π§ Email: octaviopavon7@gmail.com
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
β Star this repo if SMT helped you save compute costs! β
Made with β€οΈ by octaviopavon