Skip to content

Latest commit

 

History

History
42 lines (28 loc) · 1.56 KB

README.md

File metadata and controls

42 lines (28 loc) · 1.56 KB

This library is currently being cleaned up for public consumption, at the moment it is not very usable.

wordsalad

A small Python module for generating nonsense texts from a source text.

It generates a Markov chain after tokenising the input, picking a path at random and concatenating the visited words.

General use case

  • A corpus is tokenised, either with the provided tools, or custom ones. The actual Markov chain abstraction is in fact generally type agnostic, so groups of words can be entered as well (this often makes the text more plausible.)
  • Using a WordSaladMatrixBuilder and count_follower we note what word follows what.
  • When all words have been entered, we get a WordSaladMatrix using build_matrix.
  • One or more sentences are generated by picking a "start word", choosing a random number and then picking a follower based on their weights.
  • The above step is repeated until some stop condition occurs (stopping on . usually works well.)

Terminology

The terminology used in the library follows:

Term Explanation
corpus The text material we "train" the markov chains on.
word A word is simply a unit found in the input corpus, it can be a single character, a group of characters, or whatever.
follower A word (see above) that follows another word.

Internal details

The WordSaladMatrix class uses a sparse numpy matrix to encode the Markov chains.

Dependencies

  • numpy (used for the nice sparse matrices it provides)
  • flask (for a planned standalone web interface)

Standalone?

TBD