Skip to content

NGram language model in Python with visual explanation with Streamlit

Notifications You must be signed in to change notification settings

goldenglorys/ngram-lm-in-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

N-gram Language Model Name Generator

Overview

This project implements an N-gram language model for generating unique names based on patterns learned from a dataset of existing names. It includes a Streamlit web application for interactive exploration of the model's capabilities and visualization of its internal workings.

Features

  • N-gram language model implementation
  • Hyperparameter tuning for optimal model performance
  • Interactive web interface for name generation and model exploration
  • Visualization of model probabilities using heatmaps
  • Step-by-step name generation process breakdown

Project Structure

.
├── app.py
├── data
│   ├── names.txt
│   ├── preprocess.py
│   ├── test.txt
│   ├── train.txt
│   └── val.txt
├── ngram.py
├── poetry.lock
└── pyproject.toml
  • app.py: Streamlit web application for interacting with the model
  • ngram.py: Core implementation of the N-gram language model. check out the C implementation here
  • data/: Directory containing dataset and preprocessing script
  • poetry.lock & pyproject.toml: Poetry dependency management files

Requirements

  • Python 3.7+
  • Poetry (for dependency management)

Setup

  1. Clone the repository:

    git clone https://github.com/goldenglorys/ngram-lm-in-python.git
    cd ngram-lm-in-python
    
  2. Install dependencies using Poetry:

    poetry install
    
  3. Activate the virtual environment:

    poetry shell
    
  4. Preprocess the data (if needed):

    python data/preprocess.py
    
  5. Run the Streamlit app:

    streamlit run app.py
    
  6. Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

Usage

  1. In the Streamlit interface, adjust the hyperparameters:

    • Sequence Lengths: List of N-gram lengths to evaluate
    • Smoothings: List of smoothing values to try
    • Random Seed: Seed for reproducibility
  2. Click "Train Model and Generate Names" to start the process.

  3. Explore the results:

    • View the best hyperparameters found
    • Read generated names
    • Analyze the model's performance metrics
    • Examine the probability heatmap
    • Watch the step-by-step name generation process
  4. Experiment with different hyperparameters and observe how they affect the model's behavior and output.

How It Works

  1. Data Preprocessing: The preprocess.py script prepares the name dataset, splitting it into training, validation, and test sets.

  2. Model Training: The N-gram model (ngram.py) is trained on the preprocessed data, learning the statistical patterns of character sequences in names.

  3. Hyperparameter Tuning: The app performs a grid search over specified sequence lengths and smoothing values to find the optimal configuration.

  4. Name Generation: Using the trained model, new names are generated by sampling from the learned probability distributions.

  5. Visualization: The app creates various visualizations to help users understand the model's internal workings and decision-making process.

Customization

  • To use your own dataset, replace the contents of data/names.txt with your desired names (one per line) and run the preprocessing script.
  • Modify the hyperparameter ranges in app.py to explore different model configurations.
  • Extend the ngram.py file to implement additional language model features or alternative algorithms.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by the work on N-gram models by Andrej Karpathy's
  • Built with Streamlit for an interactive web experience
  • Visualization techniques adapted from various data science and machine learning resources

Happy name generating!

About

NGram language model in Python with visual explanation with Streamlit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages