Generates permutations of lists of letters that sound like real proper nouns.
- Try it at chalier.fr/pseudo/
- Read about how it works in this blog article
This project is licensed under the GPL-3.0 license.
Contributions are welcomed. Feel free to create pull requests with your changes!
You will need a working installation of Python 3.
Current corpus only contains the full text of Les Misérables by Victor Hugo. It is more than enough for training a basic model for French. Yet, you may want to use more recent datasets or add support for other languages. In that case, you may want to start by gathering a few megabytes of text data.
Execute the train.py script, and pass your corpus as argument. For instance, here is how the default French model was trained:
python train.py --max-token-length 5 --output-path data/tokens.tsv corpus/*
Then, put the generated TSV file in the model.zip archive. The archive.ps1 and archive.sh scripts can do that for you.
The model.zip archive contains text files serving as prefix list:
- firstnames.txt (mostly drawn from Wikipedia)
- streets.txt
You may add your own list within the archive. It should contain one entry per line. Normalization is performed on the fly, so you do not have to worry about it. Again, if you put it inside the data folder, the archive.ps1 and archive.sh scripts can add it to the archive for you.
Then, make sure to add the filename of this list as an option for the prefix select
tag in index.html (option's value should be the filename with the extension).