Attention is all I need
My attempt at staying relevant in 2025.
Based on the 2017 paper Attention Is All You Need this project implements the popular Transformer model architecture in TypeScript/JavaScript using TensorFlow.js. The goal is to gain a deeper understanding of how this foundational technology works by building it from scratch.
- Overview
- Features
- Architecture
- Installation
- Usage
- Project Structure
- How It Works
- Configuration
- Examples
- Resources
- License
The Transformer is a revolutionary deep learning architecture introduced in 2017 that has become the foundation for modern AI models like GPT, BERT, and countless others. This implementation provides a fully functional, educational implementation of the complete Transformer architecture.
What makes Transformers special?
- Parallelizable: Unlike RNNs/LSTMs, processes all positions simultaneously
- Long-range dependencies: Captures relationships across entire sequences
- Attention mechanism: Learns what to focus on automatically
- Scalable: Can be trained on massive datasets efficiently
- β Complete Transformer Architecture: Full encoder-decoder implementation
- β Multi-Head Attention: Parallel attention mechanisms for learning diverse patterns
- β Positional Encoding: Sine/cosine position embeddings
- β Layer Normalization & Residual Connections: For stable deep network training
- β Configurable Hyperparameters: Easily customize model size and capacity
- β Masking Support: Padding masks and look-ahead masks for proper training
- β TypeScript: Fully typed for better development experience
- β TensorFlow.js: Runs in Node.js (or browser with minor modifications)
- β Extensive Documentation: Every component thoroughly explained with comments
The Transformer follows the encoder-decoder architecture:
Input Sequence Target Sequence (shifted right)
β β
[Embedding + Positional Encoding] [Embedding + Positional Encoding]
β β
βββββββββββββββ βββββββββββββββ
β ENCODER β β DECODER β
β (N layers) ββββββββββββββββββββββ (N layers) β
β β Cross-Attention β β
β - Self β β - Masked β
β Attentionβ β Self-Attnβ
β - FFN β β - Cross- β
β β β Attentionβ
β β β - FFN β
βββββββββββββββ βββββββββββββββ
β
[Linear Layer]
β
[Output Logits]
-
Encoder: Processes the input sequence and builds contextualized representations
- Multi-head self-attention
- Position-wise feed-forward networks
- Layer normalization and residual connections
-
Decoder: Generates output sequence one token at a time
- Masked multi-head self-attention (can't look ahead)
- Multi-head cross-attention (attends to encoder output)
- Position-wise feed-forward networks
- Layer normalization and residual connections
-
Attention Mechanism: The core innovation
- Scaled dot-product attention
- Multi-head attention for parallel pattern learning
-
Positional Encoding: Adds position information to embeddings
- Uses sine and cosine functions at different frequencies
# Clone the repository
git clone https://github.com/nunsie/transformers.git
cd transformers
# Install dependencies
npm installimport { Transformer, TransformerConfig } from './src/transformer';
import * as tf from '@tensorflow/tfjs-node';
// Configure the model
const config: TransformerConfig = {
numLayers: 2, // Number of encoder/decoder layers
dModel: 128, // Model dimension
numHeads: 8, // Number of attention heads
dff: 512, // Feed-forward dimension
inputVocabSize: 5000, // Input vocabulary size
targetVocabSize: 5000, // Target vocabulary size
maxPositionEncoding: 1000, // Maximum sequence length
dropoutRate: 0.1, // Dropout rate
};
// Create the transformer
const transformer = new Transformer(config);
// Prepare input data (token IDs)
const input = tf.tensor2d([[1, 45, 234, 12, 89, 0, 0]], 'int32');
const target = tf.tensor2d([[2, 56, 123, 78, 0, 0, 0]], 'int32');
// Forward pass
const output = transformer.call(input, target, false);
console.log('Output shape:', output.shape);
// Expected: [batch_size, target_seq_len, target_vocab_size]
// Get predictions
const predictions = tf.argMax(output, -1);
console.log('Predictions:', await predictions.array());
// Cleanup
transformer.dispose();# Build the project
npm run build
# Run the example
npm run devtransformers/
βββ src/
β βββ attention.ts # Multi-head attention implementation
β βββ decoder.ts # Decoder layer and stack
β βββ encoder.ts # Encoder layer and stack
β βββ feedforward.ts # Position-wise feed-forward network
β βββ positional-encoding.ts # Positional encoding utilities
β βββ transformer.ts # Main Transformer model
β βββ example.ts # Example usage
β βββ index.ts # Public API exports
βββ package.json
βββ tsconfig.json
βββ README.md
Converts token IDs to dense vectors:
Token ID: 45 β Embedding: [0.1, 0.3, -0.5, ..., 0.2] (dModel dimensions)
Adds position information since Transformers process all positions in parallel:
PE(pos, 2i) = sin(pos / 10000^(2i/dModel))
PE(pos, 2i+1) = cos(pos / 10000^(2i/dModel))
Core attention mechanism:
Attention(Q, K, V) = softmax(QK^T / βd_k) V
- Q (Query): What am I looking for?
- K (Key): What information is available?
- V (Value): The actual information
Runs multiple attention mechanisms in parallel:
- Different heads learn different types of relationships
- Outputs are concatenated and projected
Two linear transformations with ReLU activation:
FFN(x) = max(0, xW1 + b1)W2 + b2
For stable deep network training:
output = LayerNorm(x + Sublayer(x))
The TransformerConfig interface allows you to customize the model:
| Parameter | Description | Typical Value |
|---|---|---|
numLayers |
Number of encoder/decoder layers | 6 |
dModel |
Model dimension (embedding size) | 512 |
numHeads |
Number of attention heads | 8 |
dff |
Feed-forward hidden dimension | 2048 |
inputVocabSize |
Size of input vocabulary | 10000 |
targetVocabSize |
Size of target vocabulary | 10000 |
maxPositionEncoding |
Maximum sequence length | 5000 |
dropoutRate |
Dropout rate for regularization | 0.1 |
Note: dModel must be divisible by numHeads.
// English to French translation
const englishTokens = tokenize("The cat sat on the mat");
const frenchTokens = tokenize("<start> Le chat");
const output = transformer.call(
tf.tensor2d([englishTokens], 'int32'),
tf.tensor2d([frenchTokens], 'int32'),
false
);
// Get next token prediction
const nextTokenProbs = tf.softmax(output.slice([0, -1], [1, 1]), -1);import { createPaddingMask, createLookAheadMask } from './src/transformer';
// Ignore padding tokens
const paddingMask = createPaddingMask(inputSequence);
// Prevent looking at future tokens
const lookAheadMask = createLookAheadMask(sequenceLength);- Original Paper: Attention Is All You Need (Vaswani et al., 2017)
- TensorFlow.js: Official Documentation
- The Illustrated Transformer: Visual Guide
- Annotated Transformer: Harvard NLP
This is a personal learning project, but suggestions and improvements are welcome! Feel free to open issues or submit pull requests.
ISC License - see package.json for details.
Nusrath Khan
- GitHub: @nunsie
Built with β€οΈ to understand the technology that's changing the world