This project implements an N-gram language model for generating unique names based on patterns learned from a dataset of existing names. It includes a Streamlit web application for interactive exploration of the model's capabilities and visualization of its internal workings.
- N-gram language model implementation
- Hyperparameter tuning for optimal model performance
- Interactive web interface for name generation and model exploration
- Visualization of model probabilities using heatmaps
- Step-by-step name generation process breakdown
.
├── app.py
├── data
│ ├── names.txt
│ ├── preprocess.py
│ ├── test.txt
│ ├── train.txt
│ └── val.txt
├── ngram.py
├── poetry.lock
└── pyproject.toml
app.py
: Streamlit web application for interacting with the modelngram.py
: Core implementation of the N-gram language model. check out the C implementation heredata/
: Directory containing dataset and preprocessing scriptpoetry.lock
&pyproject.toml
: Poetry dependency management files
- Python 3.7+
- Poetry (for dependency management)
-
Clone the repository:
git clone https://github.com/goldenglorys/ngram-lm-in-python.git cd ngram-lm-in-python
-
Install dependencies using Poetry:
poetry install
-
Activate the virtual environment:
poetry shell
-
Preprocess the data (if needed):
python data/preprocess.py
-
Run the Streamlit app:
streamlit run app.py
-
Open your web browser and navigate to the URL provided by Streamlit (usually
http://localhost:8501
).
-
In the Streamlit interface, adjust the hyperparameters:
- Sequence Lengths: List of N-gram lengths to evaluate
- Smoothings: List of smoothing values to try
- Random Seed: Seed for reproducibility
-
Click "Train Model and Generate Names" to start the process.
-
Explore the results:
- View the best hyperparameters found
- Read generated names
- Analyze the model's performance metrics
- Examine the probability heatmap
- Watch the step-by-step name generation process
-
Experiment with different hyperparameters and observe how they affect the model's behavior and output.
-
Data Preprocessing: The
preprocess.py
script prepares the name dataset, splitting it into training, validation, and test sets. -
Model Training: The N-gram model (
ngram.py
) is trained on the preprocessed data, learning the statistical patterns of character sequences in names. -
Hyperparameter Tuning: The app performs a grid search over specified sequence lengths and smoothing values to find the optimal configuration.
-
Name Generation: Using the trained model, new names are generated by sampling from the learned probability distributions.
-
Visualization: The app creates various visualizations to help users understand the model's internal workings and decision-making process.
- To use your own dataset, replace the contents of
data/names.txt
with your desired names (one per line) and run the preprocessing script. - Modify the hyperparameter ranges in
app.py
to explore different model configurations. - Extend the
ngram.py
file to implement additional language model features or alternative algorithms.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by the work on N-gram models by Andrej Karpathy's
- Built with Streamlit for an interactive web experience
- Visualization techniques adapted from various data science and machine learning resources
Happy name generating!