This library is currently being cleaned up for public consumption, at the moment it is not very usable.

wordsalad

A small Python module for generating nonsense texts from a source text.

It generates a Markov chain after tokenising the input, picking a path at random and concatenating the visited words.

General use case

A corpus is tokenised, either with the provided tools, or custom ones. The actual Markov chain abstraction is in fact generally type agnostic, so groups of words can be entered as well (this often makes the text more plausible.)
Using a WordSaladMatrixBuilder and count_follower we note what word follows what.
When all words have been entered, we get a WordSaladMatrix using build_matrix.
One or more sentences are generated by picking a "start word", choosing a random number and then picking a follower based on their weights.
The above step is repeated until some stop condition occurs (stopping on . usually works well.)

The terminology used in the library follows:

Term	Explanation
corpus	The text material we "train" the markov chains on.
word	A word is simply a unit found in the input corpus, it can be a single character, a group of characters, or whatever.
follower	A word (see above) that follows another word.

The WordSaladMatrix class uses a sparse numpy matrix to encode the Markov chains.

TBD