Skip to content

nunsie/transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Transformers

Attention is all I need

My attempt at staying relevant in 2025.

Based on the 2017 paper Attention Is All You Need this project implements the popular Transformer model architecture in TypeScript/JavaScript using TensorFlow.js. The goal is to gain a deeper understanding of how this foundational technology works by building it from scratch.

πŸ“š Table of Contents

🎯 Overview

The Transformer is a revolutionary deep learning architecture introduced in 2017 that has become the foundation for modern AI models like GPT, BERT, and countless others. This implementation provides a fully functional, educational implementation of the complete Transformer architecture.

What makes Transformers special?

  • Parallelizable: Unlike RNNs/LSTMs, processes all positions simultaneously
  • Long-range dependencies: Captures relationships across entire sequences
  • Attention mechanism: Learns what to focus on automatically
  • Scalable: Can be trained on massive datasets efficiently

✨ Features

  • βœ… Complete Transformer Architecture: Full encoder-decoder implementation
  • βœ… Multi-Head Attention: Parallel attention mechanisms for learning diverse patterns
  • βœ… Positional Encoding: Sine/cosine position embeddings
  • βœ… Layer Normalization & Residual Connections: For stable deep network training
  • βœ… Configurable Hyperparameters: Easily customize model size and capacity
  • βœ… Masking Support: Padding masks and look-ahead masks for proper training
  • βœ… TypeScript: Fully typed for better development experience
  • βœ… TensorFlow.js: Runs in Node.js (or browser with minor modifications)
  • βœ… Extensive Documentation: Every component thoroughly explained with comments

πŸ—οΈ Architecture

The Transformer follows the encoder-decoder architecture:

Input Sequence                     Target Sequence (shifted right)
     ↓                                       ↓
[Embedding + Positional Encoding]  [Embedding + Positional Encoding]
     ↓                                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ENCODER   β”‚                    β”‚   DECODER   β”‚
β”‚  (N layers) │───────────────────→│  (N layers) β”‚
β”‚             β”‚   Cross-Attention  β”‚             β”‚
β”‚  - Self     β”‚                    β”‚  - Masked   β”‚
β”‚    Attentionβ”‚                    β”‚    Self-Attnβ”‚
β”‚  - FFN      β”‚                    β”‚  - Cross-   β”‚
β”‚             β”‚                    β”‚    Attentionβ”‚
β”‚             β”‚                    β”‚  - FFN      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          ↓
                                   [Linear Layer]
                                          ↓
                                   [Output Logits]

Key Components

  1. Encoder: Processes the input sequence and builds contextualized representations

    • Multi-head self-attention
    • Position-wise feed-forward networks
    • Layer normalization and residual connections
  2. Decoder: Generates output sequence one token at a time

    • Masked multi-head self-attention (can't look ahead)
    • Multi-head cross-attention (attends to encoder output)
    • Position-wise feed-forward networks
    • Layer normalization and residual connections
  3. Attention Mechanism: The core innovation

    • Scaled dot-product attention
    • Multi-head attention for parallel pattern learning
  4. Positional Encoding: Adds position information to embeddings

    • Uses sine and cosine functions at different frequencies

πŸš€ Installation

# Clone the repository
git clone https://github.com/nunsie/transformers.git
cd transformers

# Install dependencies
npm install

πŸ’» Usage

Basic Example

import { Transformer, TransformerConfig } from './src/transformer';
import * as tf from '@tensorflow/tfjs-node';

// Configure the model
const config: TransformerConfig = {
    numLayers: 2,              // Number of encoder/decoder layers
    dModel: 128,               // Model dimension
    numHeads: 8,               // Number of attention heads
    dff: 512,                  // Feed-forward dimension
    inputVocabSize: 5000,      // Input vocabulary size
    targetVocabSize: 5000,     // Target vocabulary size
    maxPositionEncoding: 1000, // Maximum sequence length
    dropoutRate: 0.1,          // Dropout rate
};

// Create the transformer
const transformer = new Transformer(config);

// Prepare input data (token IDs)
const input = tf.tensor2d([[1, 45, 234, 12, 89, 0, 0]], 'int32');
const target = tf.tensor2d([[2, 56, 123, 78, 0, 0, 0]], 'int32');

// Forward pass
const output = transformer.call(input, target, false);

console.log('Output shape:', output.shape);
// Expected: [batch_size, target_seq_len, target_vocab_size]

// Get predictions
const predictions = tf.argMax(output, -1);
console.log('Predictions:', await predictions.array());

// Cleanup
transformer.dispose();

Running the Example

# Build the project
npm run build

# Run the example
npm run dev

πŸ“ Project Structure

transformers/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ attention.ts           # Multi-head attention implementation
β”‚   β”œβ”€β”€ decoder.ts             # Decoder layer and stack
β”‚   β”œβ”€β”€ encoder.ts             # Encoder layer and stack
β”‚   β”œβ”€β”€ feedforward.ts         # Position-wise feed-forward network
β”‚   β”œβ”€β”€ positional-encoding.ts # Positional encoding utilities
β”‚   β”œβ”€β”€ transformer.ts         # Main Transformer model
β”‚   β”œβ”€β”€ example.ts             # Example usage
β”‚   └── index.ts               # Public API exports
β”œβ”€β”€ package.json
β”œβ”€β”€ tsconfig.json
└── README.md

πŸ” How It Works

1. Embedding Layer

Converts token IDs to dense vectors:

Token ID: 45 β†’ Embedding: [0.1, 0.3, -0.5, ..., 0.2] (dModel dimensions)

2. Positional Encoding

Adds position information since Transformers process all positions in parallel:

PE(pos, 2i)   = sin(pos / 10000^(2i/dModel))
PE(pos, 2i+1) = cos(pos / 10000^(2i/dModel))

3. Scaled Dot-Product Attention

Core attention mechanism:

Attention(Q, K, V) = softmax(QK^T / √d_k) V
  • Q (Query): What am I looking for?
  • K (Key): What information is available?
  • V (Value): The actual information

4. Multi-Head Attention

Runs multiple attention mechanisms in parallel:

  • Different heads learn different types of relationships
  • Outputs are concatenated and projected

5. Feed-Forward Network

Two linear transformations with ReLU activation:

FFN(x) = max(0, xW1 + b1)W2 + b2

6. Residual Connections & Layer Normalization

For stable deep network training:

output = LayerNorm(x + Sublayer(x))

βš™οΈ Configuration

The TransformerConfig interface allows you to customize the model:

Parameter Description Typical Value
numLayers Number of encoder/decoder layers 6
dModel Model dimension (embedding size) 512
numHeads Number of attention heads 8
dff Feed-forward hidden dimension 2048
inputVocabSize Size of input vocabulary 10000
targetVocabSize Size of target vocabulary 10000
maxPositionEncoding Maximum sequence length 5000
dropoutRate Dropout rate for regularization 0.1

Note: dModel must be divisible by numHeads.

πŸ“ Examples

Machine Translation Example

// English to French translation
const englishTokens = tokenize("The cat sat on the mat");
const frenchTokens = tokenize("<start> Le chat");

const output = transformer.call(
    tf.tensor2d([englishTokens], 'int32'),
    tf.tensor2d([frenchTokens], 'int32'),
    false
);

// Get next token prediction
const nextTokenProbs = tf.softmax(output.slice([0, -1], [1, 1]), -1);

Custom Masking

import { createPaddingMask, createLookAheadMask } from './src/transformer';

// Ignore padding tokens
const paddingMask = createPaddingMask(inputSequence);

// Prevent looking at future tokens
const lookAheadMask = createLookAheadMask(sequenceLength);

πŸ“– Resources

🀝 Contributing

This is a personal learning project, but suggestions and improvements are welcome! Feel free to open issues or submit pull requests.

πŸ“„ License

ISC License - see package.json for details.

πŸ‘€ Author

Nusrath Khan


Built with ❀️ to understand the technology that's changing the world

About

Attention is all I need

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published