Trying to bring my understanding of a few concepts and optimizations that have happened over the past few years in machine learning to end boss level.
Disclaimer: No Vibe Coding - I read docs & papers and type every line of code myself (you can blame me on everything lol O_O)
- derived self-attention from scratch
- implemented my own version of the original transformers encoder-decoder architecture from "Attention is All You Need"