Skip to content

Errors in pretraining with TinyStories data #2096

@lihux25

Description

@lihux25

Bug description

I'm trying to test pretrain using litgpt. The script I'm using is the following:
litgpt pretrain EleutherAI/pythia-160m --data TinyStories --tokenizer_dir checkpoints/EleutherAI/pythia-160m

P.S., I've tried the instruction as in https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain.md ("Pretrain on custom data") and it worked for me.

But when I tried the TinyStories data, I got the following errors:

{'data': {'batch_size': 1,
'data_path': PosixPath('data/tinystories'),
'max_seq_length': -1,
'num_workers': 8,
'seed': 42,
'tokenizer': None},
'devices': 'auto',
'eval': {'evaluate_example': 'first',
'final_validation': True,
'initial_validation': False,
'interval': 1000,
'max_iters': 100,
'max_new_tokens': None},
'initial_checkpoint_dir': None,
'log': {'group': None, 'project': None, 'run': None},
'logger_name': 'tensorboard',
'model_config': {'attention_logit_softcapping': None,
'attention_scores_scalar': None,
'attn_bias': False,
'bias': True,
'block_size': 2048,
'final_logit_softcapping': None,
'gelu_approximate': 'none',
'head_size': 64,
'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
'intermediate_size': 3072,
'lm_head_bias': False,
'mlp_class_name': 'GptNeoxMLP',
'moe_intermediate_size': None,
'n_embd': 768,
'n_expert': 0,
'n_expert_per_token': 0,
'n_head': 12,
'n_layer': 12,
'n_query_groups': 12,
'name': 'pythia-160m',
'norm_1': True,
'norm_2': True,
'norm_class_name': 'LayerNorm',
'norm_eps': 1e-05,
'norm_qk': False,
'norm_qk_type': 'default',
'padded_vocab_size': 50304,
'padding_multiple': 128,
'parallel_residual': True,
'post_attention_norm': False,
'post_mlp_norm': False,
'rope_adjustments': None,
'rope_base': 10000,
'rope_condense_ratio': 1,
'rope_indices': None,
'rope_local_base_freq': None,
'rotary_percentage': 0.25,
'scale_embeddings': False,
'shared_attention_norm': False,
'sliding_window_indices': None,
'sliding_window_size': None,
'vocab_size': 50254},
'model_name': 'EleutherAI/pythia-160m',
'num_nodes': 1,
'optimizer': 'AdamW',
'out_dir': PosixPath('out/pretrain'),
'precision': None,
'resume': False,
'seed': 42,
'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
'train': {'epochs': None,
'global_batch_size': 512,
'log_interval': 1,
'lr_warmup_fraction': None,
'lr_warmup_steps': 2000,
'max_norm': 1.0,
'max_seq_length': None,
'max_steps': None,
'max_tokens': 3000000000000,
'micro_batch_size': 4,
'min_lr': 4e-05,
'save_interval': 1000,
'tie_embeddings': False}}
[rank: 0] Seed set to 42
Time to instantiate model: 0.02 seconds.
Total parameters: 162,322,944
data/tinystories/TinyStories_all_data already exists, skipping unpacking...
data/tinystories/TinyStories_all_data already exists, skipping unpacking...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/bin/litgpt", line 10, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/main.py", line 69, in main
[rank0]: CLI(parser_data)
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/jsonargparse/_cli.py", line 27, in CLI
[rank0]: return auto_cli(*args, _stacklevel=3, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/jsonargparse/_cli.py", line 129, in auto_cli
[rank0]: return _run_component(component, init.get(subcommand))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/jsonargparse/_cli.py", line 227, in _run_component
[rank0]: return component(**cfg)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/pretrain.py", line 156, in setup
[rank0]: main(
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/pretrain.py", line 218, in main
[rank0]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/pretrain.py", line 455, in get_dataloaders
[rank0]: train_dataloader = data.train_dataloader()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/data/tinystories.py", line 82, in train_dataloader
[rank0]: train_dataset = StreamingDataset(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litdata/streaming/dataset.py", line 127, in init
[rank0]: self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litdata/utilities/dataset_utilities.py", line 92, in subsample_streaming_dataset
[rank0]: roi = generate_roi(original_chunks, item_loader)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litdata/utilities/dataset_utilities.py", line 248, in generate_roi
[rank0]: roi.append((0, chunk["dim"] // item_loader._block_size))
[rank0]: ~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~
[rank0]: TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/bin/litgpt", line 10, in
[rank1]: sys.exit(main())
[rank1]: ^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/main.py", line 69, in main
[rank1]: CLI(parser_data)
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/jsonargparse/_cli.py", line 27, in CLI
[rank1]: return auto_cli(*args, _stacklevel=3, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/jsonargparse/_cli.py", line 129, in auto_cli
[rank1]: return _run_component(component, init.get(subcommand))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/jsonargparse/_cli.py", line 227, in _run_component
[rank1]: return component(**cfg)
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/pretrain.py", line 156, in setup
[rank1]: main(
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/pretrain.py", line 218, in main
[rank1]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/pretrain.py", line 455, in get_dataloaders
[rank1]: train_dataloader = data.train_dataloader()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litgpt/data/tinystories.py", line 82, in train_dataloader
[rank1]: train_dataset = StreamingDataset(
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litdata/streaming/dataset.py", line 127, in init
[rank1]: self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litdata/utilities/dataset_utilities.py", line 92, in subsample_streaming_dataset
[rank1]: roi = generate_roi(original_chunks, item_loader)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/jovyan/conda_envs/clm_seq_slm_rf/lib/python3.12/site-packages/litdata/utilities/dataset_utilities.py", line 248, in generate_roi
[rank1]: roi.append((0, chunk["dim"] // item_loader._block_size))
[rank1]: ~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~
[rank1]: TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

FYI, the packages and versions I'm using are as follows:

uv pip freeze
Using Python 3.12.11 environment at: conda_envs/clm_seq_slm_rf
absl-py==2.3.1
accelerate==1.9.0
aiohappyeyeballs==2.6.1
aiohttp==3.12.14
aiosignal==1.4.0
annotated-types==0.7.0
anyio==4.9.0
attrs==25.3.0
bitsandbytes==0.45.4
boto3==1.39.10
botocore==1.39.10
certifi==2025.7.14
chardet==5.2.0
charset-normalizer==3.4.2
click==8.2.1
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1733218098505/work
dataproperty==1.1.0
datasets==3.6.0
dill==0.3.8
docstring-parser==0.17.0
einops==0.8.1
evaluate==0.4.5
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1746947292760/work
fastapi==0.116.1
filelock==3.13.1
flash-attn==2.7.4.post1
frozenlist==1.7.0
fsspec==2024.6.1
gitdb==4.0.12
gitpython==3.1.44
grpcio==1.73.1
h11==0.16.0
hf-transfer==0.1.9
hf-xet==1.1.5
httptools==0.6.4
huggingface-hub==0.33.4
idna==3.10
importlib-resources==6.5.2
iniconfig @ file:///home/conda/feedstock_root/build_artifacts/iniconfig_1733223141826/work
jinja2==3.1.4
jmespath==1.0.1
joblib==1.5.1
jsonargparse==4.40.0
jsonlines==4.0.0
lightning==2.5.2
lightning-utilities==0.14.3
litdata==0.2.50
litgpt==0.5.9
litserve==0.2.13
lm-eval==0.4.9
lxml==6.0.0
markdown==3.8.2
markupsafe==2.1.5
mbstrdecoder==1.1.4
more-itertools==10.7.0
mpmath==1.3.0
multidict==6.6.3
multiprocess==0.70.16
networkx==3.3
nltk==3.9.1
numexpr==2.11.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-ml-py @ file:///home/conda/feedstock_root/build_artifacts/nvidia-ml-py_1746576379096/work
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
nvitop @ file:///home/conda/feedstock_root/build_artifacts/nvitop_1748285796336/work
packaging @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_packaging_1745345660/work
pandas==2.3.1
pathvalidate==3.3.1
peft==0.16.0
pillow==11.0.0
pip @ file:///home/conda/feedstock_root/build_artifacts/pip_1746249878903/work
platformdirs==4.3.8
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1747339660894/work
portalocker==3.2.0
propcache==0.3.2
protobuf==6.31.1
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1740663123172/work
pyarrow==20.0.0
pybind11==3.0.0
pydantic==2.11.7
pydantic-core==2.33.2
pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1750615794071/work
pytablewriter==1.2.1
pytest @ file:///home/conda/feedstock_root/build_artifacts/pytest_1750239416491/work
python-dateutil==2.9.0.post0
python-dotenv==1.1.1
pytorch-lightning==2.5.2
pytz==2025.2
pyyaml==6.0.2
pyzmq==27.0.0
regex==2024.11.6
requests==2.32.4
rouge-score==0.1.2
s3transfer==0.13.1
sacrebleu==2.5.1
safetensors==0.5.3
scikit-learn==1.7.1
scipy==1.16.0
sentencepiece==0.2.0
sentry-sdk==2.33.1
setuptools==80.9.0
six==1.17.0
smmap==5.0.2
sniffio==1.3.1
sqlitedict==2.1.0
starlette==0.47.2
sympy==1.13.1
tabledata==1.3.4
tabulate==0.9.0
tcolorpy==0.1.7
tensorboard==2.20.0
tensorboard-data-server==0.7.2
threadpoolctl==3.6.0
tifffile==2025.6.11
tiktoken==0.9.0
tokenizers==0.21.2
tomli @ file:///home/conda/feedstock_root/build_artifacts/tomli_1733256695513/work
torch==2.6.0+cu124
torchaudio==2.6.0+cu124
torchmetrics==1.7.4
torchvision==0.21.0+cu124
tqdm==4.67.1
tqdm-multiprocess==0.0.11
transformers==4.51.3
triton==3.2.0
typepy==1.3.4
typeshed-client==2.8.2
typing-extensions==4.14.1
typing-inspection==0.4.1
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.35.0
uvloop==0.21.0
wandb==0.21.0
watchfiles==1.1.0
websockets==15.0.1
werkzeug==3.1.3
wheel==0.45.1
word2number==1.1
xxhash==3.5.0
yarl==1.20.1
zstandard==0.23.0

I'm using also python version 3.12.11 and linux (output from uname -a is:

Linux 5.10.238-234.956.amzn2.x86_64 #1 SMP Tue Jul 1 20:20:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Reproduced in studio

No response

What operating system are you using?

Linux

LitGPT Version

Version: 0.5.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions