- Python 3.10.6
- Nvidia GPU
- Create venv
py -3.10 -m venv cuda
- Activate venv
cuda activate
- Install libs
pip install matplotlib numpy pylzma ipkernel jupyter
pip install torch --index-url https://download.pytorch.org/whl/cu118
- Install a new kernel for Jupyter Notebook
python -m ipykernel install --user --name=cuda --display-name "cuda-gpt"
- Start Jupyter Notebook
jupyter notebook
Project.2.mp4
-
Download & Prepare Dataset
- Download OpenWebText2 (~27 GB) from https://openwebtext2.readthedocs.io/en/latest/ and unpack all
.jsonl.zst
files into./openwebtext2/
.
- Download OpenWebText2 (~27 GB) from https://openwebtext2.readthedocs.io/en/latest/ and unpack all
-
Extract & Tokenize Data
- Open and run
data-extract-v10.ipynb
, which:- Streams and decompresses the
.zst
files. - Filters for English-language texts.
- Tokenizes using tiktoken tokenizer.
- Outputs
output_v10/encoded_data/encoded_output_v10_accuracy.npy
(~107 GB).
- Streams and decompresses the
- Open and run
-
Train Base GPT Model
- Run
gpt-v14.ipynb
end-to-end to:- Configure hyperparameters (depth, heads, learning rate schedule, etc.).
- Execute the training loop, logging train/validation losses.
- Save the checkpoint (e.g.,
output_v14\pre_training\run_<unix_timestamp>/gpt_v14_model.pt
).
- Run
-
Fine-Tune for Classification
- Use
finetuning-classification-v1.ipynb
to adapt the pre-trained checkpoint for a binary classification task (e.g., spam vs. ham).
- Use
-
Fine-Tune for Instruction-Following
- Use
finetuning-instruction-answer-v4.ipynb
to train on instruction–response pairs and improve the model’s conversational ability.
- Use
-
Evaluate Fine-Tuned Models
- Open
evaluate-finetuned-llm.ipynb
to:- Compute performance metrics (accuracy, loss) on held-out data.
- Compare your fine-tuned outputs against the Ollama LLaMA 3.2 3B reference baseline.
- Open
Sample generation from
gpt-v14.ipynb
after 17 epochs:
Prompt: "I like apple juice - I drink it"
→ "I like apple juice - I drink it for about 30 minutes or even 1/20 minutes. In fact it was so common, so if it would be melted and the calories for me. And for me it was a pretty cool product"
🎛️ Wrap notebooks into CLI scripts (train.py, generate.py).
🌐 Build a small Gradio/Streamlit demo for live inference.
📊 Integrate Weights & Biases or TensorBoard for experiment tracking.
🇵🇱 Experiment with Polish-language fine-tuning on local corpora.