Skip to content

Comments

Fix numpy.int32 overflow when printing token counts#822

Open
ffuuugor wants to merge 1 commit intokarpathy:masterfrom
ffuuugor:bugfix_token_count_overflow
Open

Fix numpy.int32 overflow when printing token counts#822
ffuuugor wants to merge 1 commit intokarpathy:masterfrom
ffuuugor:bugfix_token_count_overflow

Conversation

@ffuuugor
Copy link

@ffuuugor ffuuugor commented Jul 3, 2025

Summary

PyTorch implementation of GPT-2 training prints wrong token counts for large datasets (>2B tokens).

This PR fixes an integer overflow issue in DistributedDataLoader when accumulating token counts from multiple data shards. The issue occurs because _peek_data_shard() returns numpy.int32 values, which overflow at 2^31-1 (~2.1B).

To be clear, the issue only affects debug prints, the training proceeds fine on the full dataset.

Problem

When loading large datasets like 10B FineWeb, the token count is accumulated in np.int32 counter (as returned from _peek_data_shard).

Running the script produces a warning, and prints incorrect number of tokens.

/workspace/igors/llm.c/train_gpt2.py:345: RuntimeWarning: overflow encountered in scalar add
  ntok_total += shard_ntok

DataLoader: total number of tokens: 1,661,249,667 across 103 files

Solution

Cast shard_ntok to Python int before accumulation:

ntok_total += int(shard_ntok)

Training then proceeds with no warnings and correct token count:

DataLoader: total number of tokens: 10,251,184,259 across 103 files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant