FREE Reverse Engineering Self-Study Course HERE
Part 2 covers weighted aggregation with matrix multiplication, lower triangular matrices for causal masking, and applying these concepts to batched sequence data with B, T, C dimensions.
Author: Kevin Thomas
Part 1 HERE
Part 3 HERE
import torchNow let's read the file and see what we're working with. Understanding your data is crucial before building any model!
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()textOutput:
'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mechanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'
The key insight is that matrix multiplication can do weighted averages!
We use a lower triangular matrix to control what each token can "see".
[[1, 0, 0], token 0: sees only itself
[1, 1, 0], → token 1: sees tokens 0 and 1
[1, 1, 1]] token 2: sees tokens 0, 1, and 2
After normalizing rows to sum to 1, this becomes a weight matrix.
[[1.0, 0.0, 0.0], token 0: 100% of itself
[0.5, 0.5, 0.0], → token 1: 50% each of tokens 0 and 1
[0.33, 0.33, 0.33]] token 2: 33% each of all three tokens
Why is this useful?
- This is causal masking — each token only sees the past, never the future.
- Matrix multiplication applies these weights efficiently in parallel.
- This is the foundation of how attention works in transformers!
torch.manual_seed(42)Output:
<torch._C.Generator at 0x106a51a10>
# create a lower triangular matrix (only see past, not future)
# tril = "triangle lower"
a = torch.tril(torch.ones(3, 3))
aOutput:
tensor([[1., 0., 0.],
[1., 1., 0.],
[1., 1., 1.]])
# normalize each row to sum to 1 (convert to weights/probabilities)
a = a / torch.sum(a, dim=1, keepdim=True)
aOutput:
tensor([[1.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000],
[0.3333, 0.3333, 0.3333]])
# some data to aggregate (3 tokens, each with 2 features)
b = torch.randint(0, 10, (3, 2)).float()
bOutput:
tensor([[2., 7.],
[6., 4.],
[6., 5.]])
# matrix multiply: each row of result is weighted average of b's rows
c = a @ b
cOutput:
tensor([[2.0000, 7.0000],
[4.0000, 5.5000],
[4.6667, 5.3333]])
# detailed breakdown of the first row's computation
print(f'row 0: {a[0].tolist()} @ {b.tolist()}')
print(f' = 1.0 * {b[0].tolist()}')
print(f' + 0.0 * {b[1].tolist()}')
print(f' + 0.0 * {b[2].tolist()}')
print(f' = {c[0].tolist()}')
print()
print(f' → Token 0 sees only itself (future tokens multiplied by 0)')Output:
row 0: [1.0, 0.0, 0.0] @ [[2.0, 7.0], [6.0, 4.0], [6.0, 5.0]]
= 1.0 * [2.0, 7.0]
+ 0.0 * [6.0, 4.0]
+ 0.0 * [6.0, 5.0]
= [2.0, 7.0]
→ Token 0 sees only itself (future tokens multiplied by 0)
# detailed breakdown of the second row's computation
print(f'row 1: {a[1].tolist()} @ {b.tolist()}')
print(f' = 0.5 * {b[0].tolist()}')
print(f' + 0.5 * {b[1].tolist()}')
print(f' + 0.0 * {b[2].tolist()}')
print(f' = {c[1].tolist()}')
print()
print(f' → Token 1 sees the average of tokens 0 and 1 (average of both)')Output:
row 1: [0.5, 0.5, 0.0] @ [[2.0, 7.0], [6.0, 4.0], [6.0, 5.0]]
= 0.5 * [2.0, 7.0]
+ 0.5 * [6.0, 4.0]
+ 0.0 * [6.0, 5.0]
= [4.0, 5.5]
→ Token 1 sees the average of tokens 0 and 1 (average of both)
# detailed breakdown of the third row's computation
print(f'row 2: {a[2].tolist()} @ {b.tolist()}')
print(f' = 0.33 * {b[0].tolist()}')
print(f' + 0.33 * {b[1].tolist()}')
print(f' + 0.33 * {b[2].tolist()}')
print(f' = {c[2].tolist()}')
print()
print(f' → Token 2 sees the average of tokens 0, 1, and 2 (average of all)')Output:
row 2: [0.3333333432674408, 0.3333333432674408, 0.3333333432674408] @ [[2.0, 7.0], [6.0, 4.0], [6.0, 5.0]]
= 0.33 * [2.0, 7.0]
+ 0.33 * [6.0, 4.0]
+ 0.33 * [6.0, 5.0]
= [4.6666669845581055, 5.333333969116211]
→ Token 2 sees the average of tokens 0, 1, and 2 (average of all)
Now let's apply this averaging concept to actual sequence data with the below.
- B = batch size (multiple sequences)
- T = time (sequence length / positions)
- C = channels (features per token)
Each token is described by C numbers (a vector).
| C value | what it means |
|---|---|
| C = 2 | each token is described by 2 numbers (a 2D vector) |
| C = 64 | each token is described by 64 numbers (like Karpathy's tutorial) |
| C = 768 | each token is described by 768 numbers (like GPT-2) |
Example with shape (B=4, T=8, C=2).
x[0, 0, :] → [0.3, -0.5] # token 0 in sequence 0: two features
x[0, 1, :] → [1.2, 0.8] # token 1 in sequence 0: two features
Why multiple features?
- A single number can't capture much meaning.
- 768 numbers can encode rich information: "this token is a noun, relates to animals, is positive sentiment, etc."
- These features are learned during training, the model figures out what numbers to assign.
In this toy example, C = 2 keeps things small so you can see what's happening.
torch.manual_seed(42)Output:
<torch._C.Generator at 0x106a51a10>
# define batch dimension
B = 4 # batch size: 4 independent sequences
BOutput:
4
# define time dimension
T = 8 # sequence length: 8 tokens/positions in each sequence
TOutput:
8
# define channel dimension
C = 2 # feature size: 2 features per token
COutput:
2
# start with random data
x = torch.randn(B, T, C)
xOutput:
tensor([[[ 1.9269, 1.4873],
[ 0.9007, -2.1055],
[ 0.6784, -1.2345],
[-0.0431, -1.6047],
[-0.7521, 1.6487],
[-0.3925, -1.4036],
[-0.7279, -0.5594],
[-0.7688, 0.7624]],
[[ 1.6423, -0.1596],
[-0.4974, 0.4396],
[-0.7581, 1.0783],
[ 0.8008, 1.6806],
[ 1.2791, 1.2964],
[ 0.6105, 1.3347],
[-0.2316, 0.0418],
[-0.2516, 0.8599]],
[[-1.3847, -0.8712],
[-0.2234, 1.7174],
[ 0.3189, -0.4245],
[ 0.3057, -0.7746],
[-1.5576, 0.9956],
[-0.8798, -0.6011],
[-1.2742, 2.1228],
[-1.2347, -0.4879]],
[[-0.9138, -0.6581],
[ 0.0780, 0.5258],
[-0.4880, 1.1914],
[-0.8140, -0.7360],
[-1.4032, 0.0360],
[-0.0635, 0.6756],
[-0.0978, 1.8446],
[-1.1845, 1.3835]]])
print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'first sequence (batch 0):')
for t in range(T):
print(f'position {t}: {x[0, t].tolist()}')Output:
sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features
first sequence (batch 0):
position 0: [1.9269150495529175, 1.4872841835021973]
position 1: [0.9007171988487244, -2.1055214405059814]
position 2: [0.6784184575080872, -1.2345449924468994]
position 3: [-0.043067481368780136, -1.6046669483184814]
position 4: [-0.7521361708641052, 1.6487228870391846]
position 5: [-0.3924786448478699, -1.4036067724227905]
position 6: [-0.7278812527656555, -0.5594298839569092]
position 7: [-0.7688389420509338, 0.7624453902244568]
print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'second sequence (batch 1):')
for t in range(T):
print(f'position {t}: {x[1, t].tolist()}')Output:
sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features
second sequence (batch 1):
position 0: [1.6423169374465942, -0.15959732234477997]
position 1: [-0.4973974823951721, 0.4395892322063446]
position 2: [-0.7581311464309692, 1.078317642211914]
position 3: [0.8008005023002625, 1.680620551109314]
position 4: [1.27912437915802, 1.2964228391647339]
position 5: [0.610466480255127, 1.334737777709961]
position 6: [-0.2316243201494217, 0.041759490966796875]
position 7: [-0.2515752613544464, 0.859858512878418]
print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'third sequence (batch 2):')
for t in range(T):
print(f'position {t}: {x[2, t].tolist()}')Output:
sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features
third sequence (batch 2):
position 0: [-1.3846741914749146, -0.8712361454963684]
position 1: [-0.2233659327030182, 1.7173610925674438]
position 2: [0.31887972354888916, -0.42451897263526917]
position 3: [0.30572032928466797, -0.7745925188064575]
position 4: [-1.5575722455978394, 0.9956361055374146]
position 5: [-0.8797858357429504, -0.601142942905426]
position 6: [-1.2741514444351196, 2.1227850914001465]
position 7: [-1.234653353691101, -0.4879138767719269]
print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'fourth sequence (batch 3):')
for t in range(T):
print(f'position {t}: {x[3, t].tolist()}')Output:
sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features
fourth sequence (batch 3):
position 0: [-0.9138230085372925, -0.6581372618675232]
position 1: [0.07802387326955795, 0.5258087515830994]
position 2: [-0.48799172043800354, 1.1913691759109497]
position 3: [-0.8140076398849487, -0.7359928488731384]
position 4: [-1.4032478332519531, 0.036003824323415756]
position 5: [-0.06347727030515671, 0.6756148934364319]
position 6: [-0.0978068932890892, 1.8445940017700195]
position 7: [-1.184537410736084, 1.3835493326187134]
