Skip to content

Commit

Permalink
Add pytorch implementations + tidy
Browse files Browse the repository at this point in the history
  • Loading branch information
apehex committed Sep 14, 2024
1 parent 13cd6d7 commit 5d31b6f
Showing 1 changed file with 77 additions and 26 deletions.
103 changes: 77 additions & 26 deletions articles/unicode.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,10 +274,13 @@ Then an input tensor with a batch dimension `B` of 128 and sequence dimension of
- and exit the composite embedding layer as a tensor of shape `(B, S / T, T * C) = (128, 256, 4608)`

The LLM would process the input as a sequence of 256 embeddings, each representing 16 characters.

And each of these embeddings is actually made from the concatenation of 64 byte embeddings.

Now the LLM knows the composition of each token and can natively perform calculations, create and understand "out-of-vocabulary" words, etc.
This layer can then be trained and the embeddings for each byte can be adjusted by the model.
It allows the model to set an independent meaning to each byte, contrary to the two schemes in the sections above.

Now the LLM knows the composition of each token.
It can natively perform calculations, create and understand neologisms, etc.

## Comparison With Tokenization

Expand Down Expand Up @@ -390,6 +393,26 @@ To sum up:

So there are pros and cons to both approaches.

## Next

With this, LLMs can parse inputs up to the byte level, allowing:

- number calculations
- handling rare words
-

LLMs are still in the stone age

compiler + llm using tokun embeddings

composition <=> embeddings
weaker form of semantic similarity => improved?

can these embedding and prediction techniques be further improved?

obviously this research of western centric, because of my limited knowledge.
I'd be interested to have other POV, don't hesitate to reach out :)

## Implementations

### Composite Embeddings
Expand All @@ -407,6 +430,17 @@ class TokunEmbedding(keras.layers.Embedding):
return keras.ops.einsum('bste -> bs(te)', __outputs)
```

Or in PyTorch:

```python
class TokunEmbedding(torch.nn.Embedding):
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
# embed each element separately
__outputs = super(TokunEmbedding, self).forward(inputs)
# concatenate the embeddings
return torch.einsum('bste -> bs(te)', __outputs)
```

The `einsum` operation could be replaced with a more generic "merge" operation independent of the rank of its input.
For example, the `einsum` equation could be generated according to the rank of the input:

Expand All @@ -423,11 +457,11 @@ The targets for the binary predictions are calculated by decomposing the inputs
For example in Tensorflow:

```python
def expand_base(data: tf.Tensor, base: int, depth: int) -> tf.Tensor:
def expand_base(data: tf.Tensor, base: int, depth: int, bigendian: bool=True) -> tf.Tensor:
__shape = len(list(data.shape)) * [1] + [depth]
# base indexes, in big endian
__idx = range(depth)[::-1]
# base divisor and modulus
# base indexes
__idx = range(depth)[::-1] if bigendian else range(depth)
# base divisor and moduli
__div = tf.convert_to_tensor([base ** __e for __e in __idx], dtype=data.dtype)
__mod = tf.convert_to_tensor([base ** (__e + 1) for __e in __idx], dtype=data.dtype)
# match the input shape
Expand All @@ -439,41 +473,58 @@ def expand_base(data: tf.Tensor, base: int, depth: int) -> tf.Tensor:
return tf.cast(__digits, dtype=data.dtype)
```

In PyTorch:

```python
def expand_base(data: torch.Tensor, base: int, depth: int, bigendian: bool=True) -> torch.Tensor:
__shape = data.dim() * [1] + [depth]
# base indexes
__idx = range(depth)[::-1] if bigendian else range(depth)
# base divisors and moduli
__div = torch.tensor([base ** __e for __e in __idx], dtype=data.dtype, device=data.device)
__mod = torch.tensor([base ** (__e + 1) for __e in __idx], dtype=data.dtype, device=data.device)
# match the input shape
__div = __div.view(__shape)
__mod = __mod.view(__shape)
# Euclidean algorithm
__digits = torch.div(torch.remainder(data.unsqueeze(-1), __mod), __div, rounding_mode='floor')
# format
return __digits.to(data.dtype)
```

During inference, the predictions can be interpreted by doing the reverse operation:

```python
def reduce_base(data: tf.Tensor, base: int, axis: int=-1, keepdims: bool=False, bigendian: bool=True) -> tf.Tensor:
__rank = len(data.shape)
# select the dimension of the given axis
__shape = mlable.shaping.filter_shape(shape=data.shape, axes=[axis])
__shape = [__d if (__i - axis) % __rank == 0 else 1 for __i, __d in enumerate(list(data.shape))]
# exponents
__exp = range(__shape[axis])[::-1] if bigendian else range(__shape[axis])
# base, in big endian
# base multipliers
__base = tf.convert_to_tensor([base ** __e for __e in __exp], dtype=data.dtype)
# match the input shape
__base = tf.reshape(__base, shape=__shape)
# recompose the number
return tf.reduce_sum(data * __base, axis=axis, keepdims=keepdims)
```

This layer can then be trained and the embeddings for each byte can be adjusted by the model.

It allows the model to set an independent meaning to each byte, contrary to the two schemes in the sections above.

## Next
Or, in PyTorch:

LLMs are still in the stone age

why / how did tokenization last so long?

compiler + llm using tokun embeddings

composition <=> embeddings
weaker form of semantic similarity => improved?

can these embedding and prediction techniques be further improved?

obviously this research of western centric, because of my limited knowledge.
I'd be interested to have other POV, don't hesitate to reach out :)
```python
def reduce_base(data: torch.Tensor, base: int, axis: int=-1, keepdims: bool=False, bigendian: bool=True) -> torch.Tensor:
__rank = len(data.shape)
# select the dimension of the given axis
__shape = [__d if (__i - axis) % __rank == 0 else 1 for __i, __d in enumerate(list(data.shape))]
# exponents
__exp = torch.arange(__shape[axis] - 1, -1, -1) if bigendian else torch.arange(__shape[axis])
# base multipliers
__base = torch.pow(base, __exp).to(data.device).to(data.dtype)
# match the input shape
__base = __base.view(__shape)
# Recompose the number
return torch.sum(data * __base, dim=axis, keepdim=keepdims)
```

## Language Bases

Expand Down

0 comments on commit 5d31b6f

Please sign in to comment.