Language Modelling for Fairy Tale Generation

Introduction

Natural Language Generation (NLG) is a rapidly evolving field in AI and NLP, focusing on creating understandable and contextually relevant human language text. At the core of modern NLG lies Language Modelling, which learns the probability distribution of token sequences (words, characters, etc.). This project delves into creative text generation by training language models to produce fairy tales.

Project Motivation

This project is driven by the desire to explore the capabilities of modern neural network architectures (LSTM and Transformer) in the creative domain of fairy tale generation. The goal is to generate text that is not only grammatically correct and coherent but also creative, stylistically consistent, and engaging.

The Challenge: Fairy Tale Generation

Fairy tales were chosen for their:

Distinct Style: Unique vocabulary, sentence structures, and narrative voice.
Narrative Structure: Recognizable plot patterns, archetypal characters, and moral lessons.
World Knowledge: Implicit understanding of common tropes.
Coherence: Maintaining plot, character, and setting consistency.

Models Under Investigation

This project implements, trains, and compares two leading neural network architectures for sequence modelling:

Long Short-Term Memory (LSTM): A type of RNN adept at handling sequential data and capturing long-range dependencies.
Transformer: A newer architecture relying entirely on self-attention mechanisms, known for state-of-the-art performance in NLP.

The comparison aims to shed light on their respective strengths and weaknesses in generating creative, structured text using character-level inputs.

Dataset and Preprocessing

Data Source: A corpus composed primarily of text from classic fairy tales.
Tokenization Strategy: Character-Level Deep Dive.
- Every character (letters, digits, punctuation, whitespace) is treated as an individual token.
- Results in a small vocabulary but requires models to handle much longer sequences.
- Advantages: No Out-of-Vocabulary (OOV) issues, implicit morphology learning.
- Disadvantages: Longer sequences, increased context requirement, potentially weaker semantic units.
Vocabulary Construction:
- Unique characters mapped to integer IDs.
- Special tokens added: <|pad|>, <|unk|>, <|endoftext|>.
- Vocabulary saved (e.g., vocab.json) for consistent use.
Data Cleaning: (Implicit/explicit steps like lowercasing, whitespace normalization).
Data Splitting: Training and Validation sets (e.g., 80-90% train, 10-20% validation).

Methodology

Character-Level Approach

The project employs character-level tokenization. This means the models learn to predict the next character in a sequence based on the preceding characters.

Foundational Concepts

Embeddings: Character IDs are converted into dense vector representations (Embedding Layer).
Autoregressive Prediction: The model predicts the next token based on previous tokens.

LSTM Model

Architecture: Processes sequences step-by-step, using gates (Forget, Input, Output) to control information flow through a cell state. Can be stacked in multiple layers.
Output: Final hidden states are passed through a linear layer to predict the next character.

Transformer Model

Architecture: Relies on self-attention mechanisms, processing the entire sequence simultaneously. Key components include:
- Multi-Head Self-Attention (Queries, Keys, Values)
- Positional Encodings
- Causal Masking (for autoregressive generation)
- Position-wise Feedforward Networks
- Residual Connections & Layer Normalization
- Can be stacked in multiple encoder blocks.
Output: Final representations are passed through a linear layer for character prediction.

Training Process

Initialization: Load model, tokenizer, datasets, optimizer (Adam), loss function (CrossEntropyLoss).
Epoch Iteration: Loop through the training dataset multiple times.
Batch Processing:
- Forward pass to get logits.
- Calculate loss.
Backpropagation: Compute gradients.
Optimization: Update model parameters.
Gradient Clipping: Prevent exploding gradients (max_norm: 1.0).
Validation: Evaluate on the validation set after each epoch.
Checkpointing: Save the best model based on validation loss (e.g., best_model.pt).

Key Hyperparameters Investigated:

Learning Rate (e.g., LSTM: 0.001, Transformer: 0.0001)
Embedding Dimension (e.g., 256)
Hidden/Feedforward Dimensions (e.g., LSTM Hidden: 512, Transformer FF: 512)
Number of Layers (e.g., 2)
Dropout Rate (e.g., LSTM: 0.2, Transformer: 0.1)
Sequence Length (e.g., 50, 60)
Batch Size (e.g., 6, 64)

Getting Started & Running the Models

This section guides you through setting up the environment and running the models.

Prerequisites

Python 3.10
PyTorch 1.12.1
NumPy 1.22.4
Matplotlib 3.5.2
TQDM 4.67.1

Other dependencies (as listed in your report, or from a requirements.txt file):

contourpy==1.3.0
cycler==0.12.1
fonttools==4.57.0
importlib-resources==6.5.2
kiwisolver==1.4.7
packaging==24.2
pillow==11.1.0
pyparsing==3.2.3
python-dateutil==2.9.0.post0
six==1.17.0
typing-extensions==4.13.1
zipp==3.21.0

Setup

Clone the repository:

git clone [https://github.com/dhirendrachoudhary/LanguageModelling.git](https://github.com/dhirendrachoudhary/LanguageModelling.git)
cd LanguageModelling

Set up a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt # Or pip install torch torchvision torchaudio numpy matplotlib tqdm ...

Download/Prepare Dataset: bash # Example: # python preprocess_data.py --input_file path/to/raw_fairy_tales.txt --output_dir data/

Training Models

```bash
# Example for training LSTM:
# python train.py --model_type lstm --config_path config/lstm_config.json --data_path data/processed_text.txt --vocab_path data/vocab.json

# Example for training Transformer:
# python train.py --model_type transformer --config_path config/transformer_config.json --data_path data/processed_text.txt --vocab_path data/vocab.json
```

Generating Text

```bash
# Example for text generation:
# python generate.py --model_path checkpoints/best_lstm_model.pt --vocab_path data/vocab.json --seed_text "Once upon a time" --max_length 200 --sampling_strategy top-k --k 10
```

Sampling Strategies Available:
- Greedy Search
- Top-k Sampling (specify k)
- Nucleus (Top-p) Sampling (specify p)
- Temperature Scaling (specify temperature)

Results and Evaluation

Loss Curves

Training and validation loss curves were monitored to track learning dynamics.

The LSTM model showed steady decline in training loss, with validation loss closely tracking.
The Transformer model converged faster initially but showed a slight gap later, hinting at mild overfitting.

Interactive Suggestion: markdown ![Loss Curves](path/to/your/loss_curve_image.png)

Qualitative Analysis

LSTM: Produced readable character sequences, some fairy tale elements. Struggled with global coherence and showed repetition.
Transformer: Consistently produced more fluent and longer coherent passages. Exhibited higher lexical diversity and better captured fairy tale style, including imaginative elements.

Quantitative Metrics

Feature	LSTM Model	Transformer Model	Notes
Training Time	~2 hrs / 15 mins/ep	~1.5 hrs / 11 mins/ep	Tested on 1× V100 GPU (Report: RTX 2080)
Best Valid Loss	1.45	1.38	Cross-entropy loss
Best Valid PPL	4.26	3.97	exp(loss), Lower is better
Distinct-1 (Gen.)	0.08	0.10	Based on ~1K sampled tokens
Distinct-2 (Gen.)	0.25	0.35	Transformer: more diverse phrases
BLEU Score (Gen.)	10.5	12.0	Ref: curated fairy tales corpus

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
data		data
examples		examples
generations		generations
outputs		outputs
pretraining_n_finetuning		pretraining_n_finetuning
reports		reports
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
perplexity_results.txt		perplexity_results.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Modelling for Fairy Tale Generation

Table of Contents

Introduction

Project Motivation

The Challenge: Fairy Tale Generation

Models Under Investigation

Dataset and Preprocessing

Methodology

Character-Level Approach

Foundational Concepts

LSTM Model

Transformer Model

Training Process

Key Hyperparameters Investigated:

Getting Started & Running the Models

Prerequisites

Setup

Training Models

Generating Text

Results and Evaluation

Loss Curves

Qualitative Analysis

Quantitative Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

dhirendrachoudhary/LanguageModelling

Folders and files

Latest commit

History

Repository files navigation

Language Modelling for Fairy Tale Generation

Table of Contents

Introduction

Project Motivation

The Challenge: Fairy Tale Generation

Models Under Investigation

Dataset and Preprocessing

Methodology

Character-Level Approach

Foundational Concepts

LSTM Model

Transformer Model

Training Process

Key Hyperparameters Investigated:

Getting Started & Running the Models

Prerequisites

Setup

Training Models

Generating Text

Results and Evaluation

Loss Curves

Qualitative Analysis

Quantitative Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages