TimeCapsule LLM

🌐 Language

  </div>
</div>

TimeCapsule LLM

A language model trained from scratch exclusively on data from certain places and time periods to reduce modern bias and emulate the voice, vocabulary, and worldview of the era.

Imagine if an AI model didnt just pretend to be historical but actually was.

v0 and v0.5 built on nanoGPT by Andrej Karpathy Core training scripts and model architecture are his work.

v1 built on Phi 1.5 by Microsoft

Hugging Face Link

Model Behavior & Limitations

v0

Early prompts show the model responding with 1800's language and behavior. Example: Prompt: "Who art Henry?" and it replied "I know that man, I have did not a black, the storm."

No mention of modern concepts
Mostly era-accurate vocabulary
Sentences are mostly incoherent (expected for ~187MB training data)

v0.5

A significant improvement over v0.

Victorian writing style, proper punctuation, mostly grammatical sentences
Still high factual hallucination rate
OCR noise (“Digitized by Google”) still present in outputs

v1

First model to recall and connect a real historical event with an actual figure from the dataset.

Example: Prompt: "It was the year of our Lord 1834"

The output: "It was the year of our Lord 1834 and the streets of London were filled with protest and petition. The cause, as many re counted, was not bound in the way of private, but having taken up the same day in the day of Lord Palmerston, the public will receive a short statement of the difficulties under which the day of law has reached us. It is a matter of deep regret, that the present events in the history of the world are clear, and consequently will be'known. It is not true that the very men who first settled in the Gospel at Jerusalem should have so extensive and so interesting a record of the prosperity and prosperity"

At first I assumed that a protest might have coincidentally taken place the same year but take a look at this:

Why this matters:

This is the first example of one of my models connecting a year to both a real historical event and a real person tied to that event (Lord Palmerston). Earlier models (v0 and v0.5) could mimic writing styles of the 19th century but would always hallucinate events, people and facts. This shows the model is beggining to remember things from the dataset

Upcoming Plans

There are nearly 175,000 texts published in London from 1800-1875 on Internet Archive
I plan on expanding the corpus and cleaning it more for better reasoning abilities
Expanding to different regions and time periods for more historical models

How to Use

This project focuses mostly on curating historical data, preparing it for training and building a tokenizer. I am not going to cover the full LLM training process, for that refer to nanoGPT by Andrej Karpathy.

Step 1: Gather and Prepare Historical Texts

Collect .txt files of public domain books, documents, etc from your chosen time period (e.g., London 1800-1850)
Keep them within your chosen time/place window
Clean the text files using a script or manually remove headers/footer from Project Gutenberg, Modern annotations or things like OCR errors.

Step 2: Build a Custom Tokenizer

Run train_tokenizer.py or train_tokenizer_hf.py on the cleaned data.
This will give you vocab.json and merges.txt
Thes files define vocab and merge rules for your model

Step 3: Train Your Model

Refer to nanoGPT by Andrej Karpathy for the training process or your chosen architecture’s docs.

FAQ

What is Selective Temporal Training ?

Selective Temporal Training (STT) is a machine learning methodology where all training data is specifically curated to fall within a specific historical time period. It's done in order to model the language and knowledge of that era without influence from modern concepts. For example, the current model I have now (v0.5) is trained on data exclusively from 1800-1875, it's not fine tuned but trained from scratch resulting in output that reflects the linguistic style and historical context of that time period.

Why not just use fine-tuning or LoRA?

For this project I'm trying to create a language model that is unclouded from modern bias. If I fine-tune something like GPT-2, it's already pre-trained and that information won't go away. If I train from scratch the language model won't pretend to be old, it just will be. The Goal for this project right now is to create something can reason exclusively using knowledge from London books published between 1800 and 1875.

What kind of data did you use for training?

I'm using books, legal documents, newspapers, and other writings from 1800–1875 London. The list I linked (for v0) has like 200 but for the first training I just used 50 files about ~187 MB. You can view a list of the documents: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%20of%20London%20Documents%20for%20Time%20Capsule%20LLM.txt

Dataset sizes: v0: ~187MB v0.5: ~435MB v1: ~6.25GB

How large are the models ?

V0: 16M Parameters

V0.5 123M Parameters

V1: 700M Parameters

Training Specs ?

V0/V0.5

GPU: Geforce rtx 4060 CPU: i5-13400F Ram: 16GB DDR5.

V1

GPU: A100 rented

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
london_1800_1850_v0		london_1800_1850_v0
london_1800_1875_v0.5		london_1800_1875_v0.5
.gitignore		.gitignore
1834protest.png		1834protest.png
Copy of London Documents for Time Capsule LLM.txt		Copy of London Documents for Time Capsule LLM.txt
LICENSE		LICENSE
README.md		README.md
download_texts_improved.py		download_texts_improved.py
internet_archive_ids.txt		internet_archive_ids.txt
london_corpus_dataset.py		london_corpus_dataset.py
searchfilter.jpg		searchfilter.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TimeCapsule LLM

Model Behavior & Limitations

v0

v0.5

v1

Why this matters:

Upcoming Plans

How to Use

Step 1: Gather and Prepare Historical Texts

Step 2: Build a Custom Tokenizer

Step 3: Train Your Model

FAQ

What is Selective Temporal Training ?

Why not just use fine-tuning or LoRA?

What kind of data did you use for training?

How large are the models ?

Training Specs ?

V0/V0.5

V1

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

haykgrigo3/TimeCapsuleLLM

Folders and files

Latest commit

History

Repository files navigation

TimeCapsule LLM

Model Behavior & Limitations

v0

v0.5

v1

Why this matters:

Upcoming Plans

How to Use

Step 1: Gather and Prepare Historical Texts

Step 2: Build a Custom Tokenizer

Step 3: Train Your Model

FAQ

What is Selective Temporal Training ?

Why not just use fine-tuning or LoRA?

What kind of data did you use for training?

How large are the models ?

Training Specs ?

V0/V0.5

V1

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages