Skip to content

mytechnotalent/HackingGPT-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

FREE Reverse Engineering Self-Study Course HERE


HackingGPT

Part 2

Part 2 covers weighted aggregation with matrix multiplication, lower triangular matrices for causal masking, and applying these concepts to batched sequence data with B, T, C dimensions.

Author: Kevin Thomas


Part 1 HERE

Part 3 HERE



import torch

Step 1: Load and Inspect the Data

Now let's read the file and see what we're working with. Understanding your data is crucial before building any model!

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
text

Output:

'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mechanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'

Step 2: Weighted Aggregation with Matrix Multiplication

The key insight is that matrix multiplication can do weighted averages!

How does this work?

We use a lower triangular matrix to control what each token can "see".

[[1, 0, 0],      token 0: sees only itself
 [1, 1, 0],  →   token 1: sees tokens 0 and 1
 [1, 1, 1]]      token 2: sees tokens 0, 1, and 2

After normalizing rows to sum to 1, this becomes a weight matrix.

[[1.0,  0.0,  0.0],      token 0: 100% of itself
 [0.5,  0.5,  0.0],  →   token 1: 50% each of tokens 0 and 1
 [0.33, 0.33, 0.33]]     token 2: 33% each of all three tokens

Why is this useful?

  • This is causal masking — each token only sees the past, never the future.
  • Matrix multiplication applies these weights efficiently in parallel.
  • This is the foundation of how attention works in transformers!
torch.manual_seed(42)

Output:

<torch._C.Generator at 0x106a51a10>
# create a lower triangular matrix (only see past, not future)
# tril = "triangle lower"
a = torch.tril(torch.ones(3, 3))
a

Output:

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
# normalize each row to sum to 1 (convert to weights/probabilities)
a = a / torch.sum(a, dim=1, keepdim=True)
a

Output:

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
# some data to aggregate (3 tokens, each with 2 features)
b = torch.randint(0, 10, (3, 2)).float()
b

Output:

tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
# matrix multiply: each row of result is weighted average of b's rows
c = a @ b
c

Output:

tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])
# detailed breakdown of the first row's computation
print(f'row 0: {a[0].tolist()} @ {b.tolist()}')
print(f'     =  1.0 * {b[0].tolist()}') 
print(f'     +  0.0 * {b[1].tolist()}')
print(f'     +  0.0 * {b[2].tolist()}')
print(f'     = {c[0].tolist()}')
print()
print(f'     →  Token 0 sees only itself (future tokens multiplied by 0)')

Output:

row 0: [1.0, 0.0, 0.0] @ [[2.0, 7.0], [6.0, 4.0], [6.0, 5.0]]
     =  1.0 * [2.0, 7.0]
     +  0.0 * [6.0, 4.0]
     +  0.0 * [6.0, 5.0]
     = [2.0, 7.0]

     →  Token 0 sees only itself (future tokens multiplied by 0)

# detailed breakdown of the second row's computation
print(f'row 1: {a[1].tolist()} @ {b.tolist()}')
print(f'     =  0.5 * {b[0].tolist()}')
print(f'     +  0.5 * {b[1].tolist()}')
print(f'     +  0.0 * {b[2].tolist()}')
print(f'     = {c[1].tolist()}')
print()
print(f'     →  Token 1 sees the average of tokens 0 and 1 (average of both)')

Output:

row 1: [0.5, 0.5, 0.0] @ [[2.0, 7.0], [6.0, 4.0], [6.0, 5.0]]
     =  0.5 * [2.0, 7.0]
     +  0.5 * [6.0, 4.0]
     +  0.0 * [6.0, 5.0]
     = [4.0, 5.5]

     →  Token 1 sees the average of tokens 0 and 1 (average of both)

# detailed breakdown of the third row's computation
print(f'row 2: {a[2].tolist()} @ {b.tolist()}')
print(f'     =  0.33 * {b[0].tolist()}')
print(f'     +  0.33 * {b[1].tolist()}')
print(f'     +  0.33 * {b[2].tolist()}')
print(f'     = {c[2].tolist()}')
print()
print(f'     →  Token 2 sees the average of tokens 0, 1, and 2 (average of all)')

Output:

row 2: [0.3333333432674408, 0.3333333432674408, 0.3333333432674408] @ [[2.0, 7.0], [6.0, 4.0], [6.0, 5.0]]
     =  0.33 * [2.0, 7.0]
     +  0.33 * [6.0, 4.0]
     +  0.33 * [6.0, 5.0]
     = [4.6666669845581055, 5.333333969116211]

     →  Token 2 sees the average of tokens 0, 1, and 2 (average of all)

Step 3: Apply to Sequence Data

Now let's apply this averaging concept to actual sequence data with the below.

  • B = batch size (multiple sequences)
  • T = time (sequence length / positions)
  • C = channels (features per token)

What are "features per token"?

Each token is described by C numbers (a vector).

C value what it means
C = 2 each token is described by 2 numbers (a 2D vector)
C = 64 each token is described by 64 numbers (like Karpathy's tutorial)
C = 768 each token is described by 768 numbers (like GPT-2)

Example with shape (B=4, T=8, C=2).

x[0, 0, :]  →  [0.3, -0.5]  # token 0 in sequence 0: two features
x[0, 1, :]  →  [1.2, 0.8]   # token 1 in sequence 0: two features

Why multiple features?

  • A single number can't capture much meaning.
  • 768 numbers can encode rich information: "this token is a noun, relates to animals, is positive sentiment, etc."
  • These features are learned during training, the model figures out what numbers to assign.

In this toy example, C = 2 keeps things small so you can see what's happening.

torch.manual_seed(42)

Output:

<torch._C.Generator at 0x106a51a10>
# define batch dimension
B = 4  # batch size: 4 independent sequences
B

Output:

4
# define time dimension
T = 8  # sequence length: 8 tokens/positions in each sequence
T

Output:

8
# define channel dimension
C = 2  # feature size: 2 features per token
C

Output:

2
# start with random data
x = torch.randn(B, T, C)
x

Output:

tensor([[[ 1.9269,  1.4873],
         [ 0.9007, -2.1055],
         [ 0.6784, -1.2345],
         [-0.0431, -1.6047],
         [-0.7521,  1.6487],
         [-0.3925, -1.4036],
         [-0.7279, -0.5594],
         [-0.7688,  0.7624]],

        [[ 1.6423, -0.1596],
         [-0.4974,  0.4396],
         [-0.7581,  1.0783],
         [ 0.8008,  1.6806],
         [ 1.2791,  1.2964],
         [ 0.6105,  1.3347],
         [-0.2316,  0.0418],
         [-0.2516,  0.8599]],

        [[-1.3847, -0.8712],
         [-0.2234,  1.7174],
         [ 0.3189, -0.4245],
         [ 0.3057, -0.7746],
         [-1.5576,  0.9956],
         [-0.8798, -0.6011],
         [-1.2742,  2.1228],
         [-1.2347, -0.4879]],

        [[-0.9138, -0.6581],
         [ 0.0780,  0.5258],
         [-0.4880,  1.1914],
         [-0.8140, -0.7360],
         [-1.4032,  0.0360],
         [-0.0635,  0.6756],
         [-0.0978,  1.8446],
         [-1.1845,  1.3835]]])
print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'first sequence (batch 0):')
for t in range(T):
    print(f'position {t}: {x[0, t].tolist()}')

Output:

sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features

first sequence (batch 0):
position 0: [1.9269150495529175, 1.4872841835021973]
position 1: [0.9007171988487244, -2.1055214405059814]
position 2: [0.6784184575080872, -1.2345449924468994]
position 3: [-0.043067481368780136, -1.6046669483184814]
position 4: [-0.7521361708641052, 1.6487228870391846]
position 5: [-0.3924786448478699, -1.4036067724227905]
position 6: [-0.7278812527656555, -0.5594298839569092]
position 7: [-0.7688389420509338, 0.7624453902244568]

print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'second sequence (batch 1):')
for t in range(T):
    print(f'position {t}: {x[1, t].tolist()}')

Output:

sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features

second sequence (batch 1):
position 0: [1.6423169374465942, -0.15959732234477997]
position 1: [-0.4973974823951721, 0.4395892322063446]
position 2: [-0.7581311464309692, 1.078317642211914]
position 3: [0.8008005023002625, 1.680620551109314]
position 4: [1.27912437915802, 1.2964228391647339]
position 5: [0.610466480255127, 1.334737777709961]
position 6: [-0.2316243201494217, 0.041759490966796875]
position 7: [-0.2515752613544464, 0.859858512878418]

print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'third sequence (batch 2):')
for t in range(T):
    print(f'position {t}: {x[2, t].tolist()}')

Output:

sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features

third sequence (batch 2):
position 0: [-1.3846741914749146, -0.8712361454963684]
position 1: [-0.2233659327030182, 1.7173610925674438]
position 2: [0.31887972354888916, -0.42451897263526917]
position 3: [0.30572032928466797, -0.7745925188064575]
position 4: [-1.5575722455978394, 0.9956361055374146]
position 5: [-0.8797858357429504, -0.601142942905426]
position 6: [-1.2741514444351196, 2.1227850914001465]
position 7: [-1.234653353691101, -0.4879138767719269]

print(f'sample sequence data')
print(f'shape: {x.shape}')
print(f'meaning: {B} sequences x {T} positions x {C} features')
print()
print(f'fourth sequence (batch 3):')
for t in range(T):
    print(f'position {t}: {x[3, t].tolist()}')

Output:

sample sequence data
shape: torch.Size([4, 8, 2])
meaning: 4 sequences x 8 positions x 2 features

fourth sequence (batch 3):
position 0: [-0.9138230085372925, -0.6581372618675232]
position 1: [0.07802387326955795, 0.5258087515830994]
position 2: [-0.48799172043800354, 1.1913691759109497]
position 3: [-0.8140076398849487, -0.7359928488731384]
position 4: [-1.4032478332519531, 0.036003824323415756]
position 5: [-0.06347727030515671, 0.6756148934364319]
position 6: [-0.0978068932890892, 1.8445940017700195]
position 7: [-1.184537410736084, 1.3835493326187134]

MIT License

About

HackingGPT part 2 where we learn the foundations to create a custom GPT from absolute scratch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors