pg_gpt2 is a complete implementation of the GPT-2 architecture entirely inside PostgreSQL. It extends the database with tensor algebra, automatic differentiation, AdamW optimization, checkpointing, and a Byte-Pair Encoding tokenizer — allowing end-to-end training and text generation purely through SQL and C extensions.
PostgreSQL is used as both the storage and execution environment for a large-scale transformer model.
Each layer, weight, and intermediate activation lives in relational tables; tensor operations are implemented as C functions returning BYTEA buffers.
Every forward pass, gradient computation, and parameter update is a deterministic SQL transaction.
The project demonstrates that a relational database can serve as a full numerical engine, state store, and model runtime — no Python, PyTorch, or external ML stack required.
Building the extension requires the PostgreSQL server development headers and build tooling so that pg_config --pgxs resolves to the pgxs.mk makefile. On Debian/Ubuntu systems install the package:
sudo apt-get install postgresql-server-dev-16If PostgreSQL is installed somewhere custom, set the PG_CONFIG environment variable to point at the desired pg_config binary before running make.
- Compile and install the extension. From the repository root run
make install. The build uses PGXS and will copypg_llmartifacts into PostgreSQL's extension directory reported bypg_config --pkglibdir. - Load the extension in a database. Connect with
psqland executeCREATE EXTENSION pg_llm;in the target database. This initializes all required tables, functions, and SQL entry points. - Verify availability. Confirm the extension is active with either
\dx pg_llminpsqlor a query such asSELECT * FROM pg_extension WHERE extname = 'pg_llm';. Successful output indicates the extension is ready for the workflow described below.
Follow the reproduction playbook for a step-by-step guide that mirrors the original GPT-2 training, evaluation, and sampling pipeline entirely within PostgreSQL.
To simplify evaluation and demos you can run PostgreSQL with the pg_gpt2 extension pre-installed using the provided Dockerfile.
# Build the image locally
docker build -t pg-gpt2-demo .
# Start PostgreSQL with pg_llm already installed in the default database
docker run --rm -e POSTGRES_PASSWORD=secret -p 5432:5432 --name pg-gpt2 pg-gpt2-demoThe container reuses the official postgres:16 entrypoint. On first start it creates the default database and automatically enables the pg_llm extension so that psql connections can immediately run the SQL workflows described below.
- Postgres as OS — All computation and persistence live in SQL schemas and C extensions.
- Full Reproducibility — Every step, gradient, and checkpoint is a logged transaction.
- Numerical Fidelity — Bit-level parity with PyTorch’s GPT-2 (
float32, row-major, GELU, LayerNorm, AdamW). - Composability — Every tensor op is an SQL function; model architectures are relational graphs.
- Auditable Learning — Because gradients and weights are rows, the entire training process is queryable and replayable.
| Component | Description |
|---|---|
| Tensor Engine | C implementations of matmul, add, gelu, softmax, layernorm, cross_entropy over contiguous float32 blobs (BYTEA). |
| Autodiff Engine | Reverse-mode differentiation recorded in a relational tape (llm_tape, llm_tensor_rt), supporting backpropagation of all GPT-2 ops. |
| Optimizer | AdamW with bias correction, decoupled weight decay, gradient clipping, and cosine learning-rate schedule. |
| Checkpointing | Import/export weights as .npz or .safetensors archives. Every snapshot is versioned in llm_checkpoint. |
| Tokenizer | Native Byte-Pair Encoding (BPE) tokenizer/decoder built from vocab.json + merges.txt. |
| Sampling Engine | Temperature, top-k, and top-p (nucleus) sampling for autoregressive generation. |
| Training Loop | SQL functions (llm_train, llm_train_step, llm_loss) orchestrate forward, backward, optimizer updates, and logging. |
| Inference | llm_generate(prompt) runs encoding → forward → sampling → decoding, returning coherent text completions. |
| Table | Purpose |
|---|---|
llm_model_config |
Registered model dimensions (layers, heads, embedding size, positions, vocab). |
llm_param |
Model parameters, gradients, optimizer state. |
llm_dataset |
Tokenized training sequences. |
llm_tape / llm_tensor_rt |
Computational graph and runtime tensors for autograd. |
llm_autograd_mode |
Single-row toggle that signals when forward passes should record autograd tape entries. |
llm_checkpoint |
Versioned checkpoint metadata and file paths. |
llm_bpe_vocab / llm_bpe_merges |
GPT-2 tokenizer vocabulary and merge ranks. |
llm_train_log |
Per-step learning rate and loss history. |
See docs/roadmap.md for the upcoming feature roadmap, including GPT-3 style architecture support, mixed-precision execution, and hardware acceleration milestones.
End-to-end training relies on a thin runtime that records every forward op in SQL so that gradients can be replayed later. The key moving pieces are:
- Parameter materialization.
llm_materialize_paramscopies each row inllm_paraminto the temporaryllm_tensorcache and creates a matching row inllm_tensor_rt. During that copy the helperpg_llm_autograd_map_param(or its SQL equivalentINSERTin the function) must be invoked so the runtime tensor id is associated with the original(model, name, token_id)tuple. Any new C routine that constructs parameter views needs to perform the same mapping or gradients will not flow back intollm_param. 【F:sql/pg_llm--0.1.0.sql†L403-L438】【F:src/pg_llm_autograd.c†L216-L246】 - Forward tape recording. Every C kernel checks
pg_llm_autograd_enabled(); when the flag is set the inputs and outputs are registered withpg_llm_autograd_track_tensorand the op is appended tollm_tapewith any metadata (shape, constants, etc.). This produces an ordered tape of all ops in the forward pass. 【F:src/pg_llm.c†L19-L210】 - Reverse traversal.
llm_backpropwalks the tape from the newest node back to the seed, dispatching gradients based on the recordednamefield and writing results intollm_tensor_rt.grad. Once complete,llm_accumulate_gradscopies those buffers back intollm_param.gradusing the mapping created in step 1. 【F:sql/llm_backprop.sql†L1-L78】【F:sql/pg_llm--0.1.0.sql†L439-L456】 - Tied embeddings. GPT-2 reuses the token embedding (
wte) for the final logits projection. After flattening the embedding table into a single matrix forpg_llm_matmul, ensure that buffer is still mapped to the original embedding rows (viapg_llm_autograd_map_param) so the logits gradient is accumulated back intowterather than a detached copy. 【F:sql/pg_llm--0.1.0.sql†L173-L205】【F:src/pg_llm_autograd.c†L216-L246】
SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz', 'gpt2-small');Imports all pretrained GPT-2 weights into the llm_param table.
llm_model_config tracks the expected architecture dimensions for each model
and is consulted during import; gpt2-small is pre-registered, but custom
models should insert their configuration before calling pg_llm_import_npz.
-- Generate text directly in SQL
SELECT llm_generate('Once upon a time', 80, 0.9, 40, 0.92);
-- Stream tokens as they are produced (step, token_id, token, text, is_complete)
SELECT * FROM llm_generate_stream('Once upon a time', 40, 0.8, 40, 0.95);-- Train for 10,000 steps on tokenized text dataset
SELECT llm_train(
'gpt2-small',
10000,
grad_workers => 4,
prune_workers => 4
);Every step performs:
- Forward pass → loss (
llm_loss) - Reverse pass (
llm_backprop) - Gradient accumulation
- AdamW parameter updates
- Logging to
llm_train_log
llm_train will automatically read the layer count, attention heads, hidden size,
and vocabulary size from llm_model_config. Provide overrides for custom
experiments by passing explicit values for n_layer, n_head, D, or vocab
when invoking the function.
The training helpers expose knobs for multi-core cleanup work:
grad_workerssets the desired parallel worker count forllm_accumulate_grads, allowing gradient materialisation fromllm_tensor_rtintollm_paramto leverage PostgreSQL's parallel query engine.prune_workersapplies the same hinting tollm_prune_autograd_state, which clears the autograd tape and runtime tensors between steps. Autograd tape pruning is safe to parallelise because every runtime tensor row is independent, so this option simply tunes planner settings before issuing the deletes.
Both parameters default to 1 (no parallel workers) to preserve existing behaviour.
-- Save a new checkpoint
SELECT llm_checkpoint_save('gpt2-small','after warmup 2k');
-- Restore a checkpoint
SELECT llm_checkpoint_load('gpt2-small',1);-- Load GPT-2 BPE vocab and merges
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');
-- Encode and decode text
SELECT llm_encode('Hello world!','gpt2-small');
SELECT llm_decode(ARRAY[15496,2159,0],'gpt2-small');The repository includes Python helpers for preparing external assets before
calling the SQL functions above. All scripts live under scripts/.
| Script | Purpose |
|---|---|
convert_gpt2_checkpoint.py |
Download/convert a HuggingFace GPT-2 checkpoint into the gzip-based .npz container expected by pg_llm_import_npz. |
ingest_tokenizer.py |
Load vocab.json and merges.txt tokenizer assets into llm_bpe_vocab/llm_bpe_merges using a PostgreSQL connection. |
prepare_dataset.py |
Tokenize raw text files with the GPT-2 tokenizer and populate llm_dataset with fixed-length (tokens, target) arrays. |
Install the optional Python dependencies with:
pip install transformers torch psycopg[binary]
Examples:
# 1. Convert HuggingFace weights to /mnt/models/gpt2-small.npz
python scripts/convert_gpt2_checkpoint.py --source gpt2 --output /mnt/models/gpt2-small.npz
# 2. Load tokenizer assets into PostgreSQL
python scripts/ingest_tokenizer.py \
--dsn postgresql://postgres@localhost:5432/postgres \
--model gpt2-small \
--vocab /mnt/gpt2/vocab.json \
--merges /mnt/gpt2/merges.txt --truncate
# 3. Tokenize a corpus and fill llm_dataset
python scripts/prepare_dataset.py \
--dsn postgresql://postgres@localhost:5432/postgres \
--tokenizer gpt2 \
--input /mnt/corpus/*.txt \
--block-size 1024 --truncate
An end-to-end walkthrough that stitches the helper scripts together is available in docs/python_workflow.md, and a fully annotated Jupyter notebook showing the SQL fine-tuning loop from data ingestion through generation lives at docs/fine_tuning_workflow.ipynb.
All core operations follow the official GPT-2 equations:
Attention [ \mathrm{Attn}(x) = \mathrm{softmax}!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V ] with causal masking and learned positional embeddings.
Feed-Forward [ \mathrm{FFN}(x) = \mathrm{GELU}(xW_1 + b_1)W_2 + b_2 ]
LayerNorm [ y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\gamma + \beta ]
Loss [ L = -\log \frac{e^{z_t}}{\sum_j e^{z_j}} ]
Optimizer (AdamW) [ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat{m}_t &= m_t / (1-\beta_1^t), \quad \hat{v}t = v_t / (1-\beta_2^t) \ \theta_t &= \theta{t-1} - \eta (\hat{m}_t / (\sqrt{\hat{v}t}+\epsilon) + \lambda\theta{t-1}) \end{aligned} ]
-- 1. Load model + tokenizer
SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz','gpt2-small');
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');
-- 2. Encode text
SELECT llm_encode('The database that dreamed of language.','gpt2-small');
-- 3. Generate continuation
SELECT llm_generate('The database that dreamed of language', 40, 0.8, 40, 0.95);
-- 4. Train or fine-tune
SELECT llm_train('gpt2-small', 5000);
-- 5. Save checkpoint
SELECT llm_checkpoint_save('gpt2-small','finetuned on corpus X');Client applications can connect to PostgreSQL using psycopg and drive the
text-generation workflow directly from Python. The :mod:pg_llm_client
package offers a high-level helper:
import psycopg
from pg_llm_client import PGLLMClient
with psycopg.connect("postgresql://postgres@localhost:5432/postgres") as conn:
client = PGLLMClient(conn)
# Single completion with tuned sampling parameters
print(client.generate("The database that dreamed of language", temperature=0.7))
# Stream tokens as they arrive
for event in client.stream("Streaming from SQL", max_tokens=8):
print(event.text)
# Retrieve the top beam search candidates
beams = client.beam_search("Once upon a", beam_width=3, max_tokens=5)
for beam in beams:
print(beam.score, beam.text)The helper wraps the SQL API so sampling temperature, beam width, and other parameters can be adjusted per request without hand-writing SQL in every client.
- All tensors are stored as raw
BYTEAblobs and processed in-memory. - Core kernels (
pg_llm_matmul, attention) use a tiled AVX2-aware micro-kernel that falls back to scalar math when SIMD is unavailable, delivering BLAS-class throughput without external dependencies. - Attention is evaluated in configurable row chunks (default 64 tokens) so that context matrices never exceed a manageable working set, enabling GPT-2 scale sequence lengths inside Postgres.
- For large models, raise
work_mem/maintenance_work_memand consider chunking your training data via windowed queries so each step fits inside the executor's memory context. - Store activations and optimizer scratch data in
UNLOGGEDtables (e.g.,CREATE UNLOGGED TABLE llm_activations (...)) to avoid WAL amplification when materializing large tensors. - Autograd tape pruning and gradient accumulation can be parallelized safely within a transaction.
- Proof of Concept: show that gradient-based learning can be expressed purely as relational algebra and transaction semantics.
- Determinism: every computation is replayable and version-controlled.
- Integration: unifies data, model, and training loop under a single ACID engine.
- Pedagogy: transparent view into transformer internals, queryable step-by-step.