This project provides some tools to do exploratory phonological comparisons between texts in unknown languages and entries one or more lexicons.
You may see the results of a recent test run of the software for the Voynich Manuscript here.
The initial goal is to investigate whether a particular theory of a possible phonological interpretation of the script in the Voynich manuscript can be used to find possible lexical matches in various machine-readable lexicons.
Stephen Bax in 2014 proposed some phonological values for various Voynich characters, based on identifications of plant and star names in some of the illustrated pages. Derek Vogt has elaborated on this work and proposed a more extensive phonological scheme. In addition, he has analyzed the phonological inventory of the scheme and proposed that the language of the Voynich manuscript is based on some variety of Romani.
At present, the Enochian software tool can take arbitrary lines from the Reed-Landini-Stolfi Interlinear transcription of the Voynich manuscript, encode each word as a sequence of vectors in phonological feature space, and then search the RomLex lexicon of Romani and the Shabda-Sagara Sanskrit dictionary, using dynamic time warping to look for for the closest phonological sequence matches.
You can see a sample of this kind of flow in the voynich.json flow configuration. This flow reads the RomLex lexicon and the specified lines of the Voynich transcription and produces an HTML file containing a report on the possible phonological matches.
Current results are inconclusive. Possible matches for words meaning "sun", "moon", "house", and "sky" appear on the first page of the Voynich manuscript, which are suggestive of references to astrological content, but much more work needs to be done.
You may see the results of a recent test run of the software for the Voynich Manuscript here.
The RomLex lexicon has fewer than 30,000 entries, many of which are duplicates, due to the lexicon containing data from multiple Romani dialects. This means it does not provide very conclusive results on its own.
The Shabda-Sagara dictionary also has fewer than 30,000 entries.
At the most general level, the Enochian library provides a system for configuring and running "flows" of arbitrary data transformations. This is implemented by the Flow class, which contains a FlowContainer which can have a number of FlowStep objects (which can themselves be containers).
When you iterate over the enumerable returned by FlowStep.GetOutputs()
, each
step will grab an output from its previous sibling and call its Process()
method on it, returning the resulting output. If you implement only
FlowStep.Process()
, or if you implement FlowStep.GetOutputs()
using yield return
, the flow process will be asynchronous; it will only process as many
items as are needed to return one output.
In order to do phonological analysis, the Enochian library provides a way to specify a phonological feature set (see features.json for an example using a pretty standard set of phonological features). The FeatureSet class is used to load and use these feature sets.
You can also define text "encodings". These take input strings in Unicode and
produce sequences of vectors in the multi-dimensional space defined by the
phonological feature set. A single phonological segment consists of an
N
-dimensional vector, where N
is the number of features in your feature set.
If a particular feature has a +
value for that segment, its corresponding
vector element will be 1
; if it has a -
value, its vector element will be
-
. If the feature is unspecified, its vector element will be 0
.
The systems includes several lexicons:
This is used for testing the underlying assumption behind the project, that we can find slightly dissimilar phonological sequences in a lexicon by means of dynamic time warping. The english_test.json contains a sample flow that compares a defective encoding of English text with the CMU dictionary to produce matches for English words. Running this flow demonstrates that the process is capable of finding many such valid matches.
This is a dictionary of words in various Romani dialects. The database is only
available via the web, so there is a project RomlexScraper
that scrapes the
web interface to assemble a complete version of the lexicon.
This is a 19th-century dictionary of classical Sanskrit.