Fix numpy.int32 overflow when printing token counts by ffuuugor · Pull Request #822 · karpathy/llm.c

ffuuugor · 2025-07-03T18:12:17Z

Summary

PyTorch implementation of GPT-2 training prints wrong token counts for large datasets (>2B tokens).

This PR fixes an integer overflow issue in DistributedDataLoader when accumulating token counts from multiple data shards. The issue occurs because _peek_data_shard() returns numpy.int32 values, which overflow at 2^31-1 (~2.1B).

To be clear, the issue only affects debug prints, the training proceeds fine on the full dataset.

Problem

When loading large datasets like 10B FineWeb, the token count is accumulated in np.int32 counter (as returned from _peek_data_shard).

Running the script produces a warning, and prints incorrect number of tokens.

/workspace/igors/llm.c/train_gpt2.py:345: RuntimeWarning: overflow encountered in scalar add
  ntok_total += shard_ntok

DataLoader: total number of tokens: 1,661,249,667 across 103 files

Solution

Cast shard_ntok to Python int before accumulation:

ntok_total += int(shard_ntok)

Training then proceeds with no warnings and correct token count:

DataLoader: total number of tokens: 10,251,184,259 across 103 files

fix np.int32 overflow in token counting

f51ab03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix numpy.int32 overflow when printing token counts#822

Fix numpy.int32 overflow when printing token counts#822
ffuuugor wants to merge 1 commit intokarpathy:masterfrom
ffuuugor:bugfix_token_count_overflow

ffuuugor commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ffuuugor commented Jul 3, 2025

Summary

Problem

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant