diff --git a/articles/unicode.md b/articles/unicode.md
index 99ba8b6..f51c3ca 100644
--- a/articles/unicode.md
+++ b/articles/unicode.md
@@ -4,84 +4,50 @@
> `tokun` took tokens to t-can
-In machine learning 3 worlds / visions are at odds: the computer, math and human sides.
+In machine learning 3 worlds are at odds: the computer, math and human sides.
Tokenization bridges the gap from machine encoding to tensor embeddings using human intuition, with algorithms like BPE.
In my [previous article][huggingface-tokenization-1], I proposed to train a model to translate / compress the encoding bytes into embeddings.
-Actually, none of this is necessary: Unicode can be used as the basis for LLM embeddings.
+Actually, none of this is necessary: Unicode can be directly used as the basis for LLM embeddings.
-## TLDR
+## TL;DR
-90%
+This article proposes to get rid of the tokenization step and to mirror the decomposition of words into characters in the embeddings.
-With only 26 letters and 10 digits, it is possible to compose 3,760,620,109,779,061 "words"
+It can be achieved with small changes to the transformer architecture, on the input and output layers.
-The expressive power of a basis is exponentially greater than a collection of elements.
+First, the inputs instead of building monolithic and unrelated embeddings:
+- the text is encoded using UTF-32-BE into a sequence of bytes (values in $\left[ 0 .. 256 \right[$)
+- each byte is embedded independently using a `(256, E)` kernel
+- the byte embeddings are merged by groups of size `T`
-similarly = generative set of base elements to compose all others
-
-This article proposes to rethink the first and last layers of LLMs in an effort to rationalize
-
-the unit of encoding is the byte,
-
-=> composite embeddings in the last section
-
-instead of building monolithic and unrelated embeddings:
-- split
-- embed the bytes
-- merge the byte embeddings into a token embedding
-
-embeddings built on the Unicode structure
+Starting from the UTF-32-BE bytes, the process is illustrated below:
-It's extremely surprising to me that this isn't the standard, considering:
-
-out of the 3 input embeddings, composite embeddings achieve:
-- sequence compression by arbitrary factor
-- numeric proximity <=> semantic similarity
-
-process = simple embeddings + reshape => cf implementation
-cf comparison for detailed analysis of perf
-
-you will find a detailed explanation for each of these points [below](#comparison-with-tokenization).
-
-- standard: Unicode is shared worldwide, while vocabularies are specific to regions / models
-- international: all languages are covered
-- native: no training required
-- compression: smallest tensor size possible
-- consistent: all tokens have the same dimension, chosen freely
-- structured: Unicode has
-- numbers: the encoding is correlated to actual number values
-- composition: embeddings now
-- timeless: the Unicode standard has little variations over time
+With `B = 128`, `S = 8192`, `T = 32`, and `E = 64` the embeddings of 32 bytes (equivalent to 8 characters) are merged into vectors of dimension 2048.
+A sequence of 8192 characters is turned into a tensor of shape `(S / T, T * E) = (256, 2048)` and a whole batch is `(128, 256, 2048)`.
-- arbitrary token length: hyper-parameter
+The bytes can be given independent meaning thanks to the embedding table.
+And the overall combination pattern of these vectors holds the information on composition.
-- reduce the sequence length: faster processing, less resources
-- give "meaning"
-- avoid "meaningless" predictions and constrain to
-
-desired properties:
-
-- compression
-- proximity
-- composition
-- timeless: concepts and dates appear more / less frequently depending on the period
-
-OUTPUT = binary predictions leverage the numeric locality != categorical (softmax) predictions
-
-instead of predicting an index out of 200k, predict the base 2 representation of the index
+The output layer could be a standard softmax of depth 256 for each byte prediction.
+But instead of evaluating each of the 256 options, it is more efficient to predict the value, as a vector of 8 bits:
-more specifically, this scheme is especially suited to predict byte values: each can be represented in base 2 by a vector of **dimension 8**.
+To this end, the head activation is replaced with a sigmoid, which returns an independent probability for each bit.
-suppose the token dimension is set to 16 characters, that's 64 bytes to predict per token or a vector of dimension 512.
+Just [like the previous iteration of tokun](tokun.md) this scheme covers most tokenization shortcomings.
+Plus:
-All in all,
+- the token length is now a hyper-parameter, it can be freely chosen
+- there is no need for extra preprocessing and / or training
+- it brings some minor model optimization, with smaller kernels
+
+You'll find more details in the [comparison section](#comparison-with-tokenization).
## TOC
@@ -106,14 +72,9 @@ All in all,
- [Binary Predictions](#binary-predictions)
- [Next](#next)
-## Notice
-
-western language
-interested on perspective other culture / continent
-
## Tokenization And Ancient Languages
-Essentially, tokenization merges individual characters (bytes) into monolithic chunks.
+Essentially, tokenization merges individual characters (bytes) into **monolithic chunks**.
Here, 56 cyrillic characters are grouped into 20 tokens:
@@ -189,7 +150,7 @@ binary error => close prediction
but, close tokens are unrelated
=> other input repr
-## Language Basis
+## Language Bases
- computer: sequence => codepoint => byte => bits
- math: tensors => axes => dimensions
@@ -491,7 +452,8 @@ def _equation(self, inputs: keras.Input) -> str:
### Binary Predictions
-The targets for the binary predictions are calculated by decomposing the inputs in base 2:
+The targets for the binary predictions are calculated by decomposing the inputs in base 2.
+For example in Tensorflow:
```python
def expand_base(data: tf.Tensor, base: int, depth: int) -> tf.Tensor:
@@ -513,7 +475,7 @@ def expand_base(data: tf.Tensor, base: int, depth: int) -> tf.Tensor:
During inference, the predictions can be interpreted by doing the reverse operation:
```python
-def _reduce_base(data: tf.Tensor, base: int, axis: int=-1, keepdims: bool=False, bigendian: bool=True) -> tf.Tensor:
+def reduce_base(data: tf.Tensor, base: int, axis: int=-1, keepdims: bool=False, bigendian: bool=True) -> tf.Tensor:
# select the dimension of the given axis
__shape = mlable.shaping.filter_shape(shape=data.shape, axes=[axis])
# exponents
@@ -543,6 +505,9 @@ weaker form of semantic similarity => improved?
can these embedding and prediction techniques be further improved?
+obviously this research of western centric, because of my limited knowledge.
+I'd be interested to have other POV, don't hesitate to reach out :)
+
[huggingface-tokenization-1]: https://huggingface.co/blog/apehex/tokenization-is-a-dead-weight
[image-pca-bytes]: .images/projector/bytes.pca.gif
[image-umap-bytes]: .images/projector/bytes.umap.gif