This library is currently being cleaned up for public consumption, at the moment it is not very usable.
A small Python module for generating nonsense texts from a source text.
It generates a Markov chain after tokenising the input, picking a path at random and concatenating the visited words.
- A corpus is tokenised, either with the provided tools, or custom ones. The actual Markov chain abstraction is in fact generally type agnostic, so groups of words can be entered as well (this often makes the text more plausible.)
- Using a
WordSaladMatrixBuilder
andcount_follower
we note what word follows what. - When all words have been entered, we get a
WordSaladMatrix
usingbuild_matrix
. - One or more sentences are generated by picking a "start word", choosing a random number and then picking a follower based on their weights.
- The above step is repeated until some stop condition occurs (stopping on
.
usually works well.)
The terminology used in the library follows:
Term | Explanation |
---|---|
corpus | The text material we "train" the markov chains on. |
word | A word is simply a unit found in the input corpus, it can be a single character, a group of characters, or whatever. |
follower | A word (see above) that follows another word. |
The WordSaladMatrix
class uses a sparse numpy matrix to encode the Markov chains.
- numpy (used for the nice sparse matrices it provides)
- flask (for a planned standalone web interface)
TBD