syllabification

GRU-based neural network with Inception modules and an optional Linear Chain CRF that splits words into syllables.

Model architecture

Tokenized data is passed into an Embedding layer and then into two 'stems' - the first stem contains a stack of 3 bidirectional GRU layers (2 x 256 = 512 units each) and the second stem uses a 1D implementation (since it is being used for sequence data, not images) of the Inception v2 module architecture. The stem outputs are concatenated and passed through two TimeDistributed layers then GlobalMaxPool1D is applied - finally there is a Dense layer with 15 units, outputting a binary string that is a prediction of the syllable breaks in the input data.

Tanh is the activation function used in the Inception module layers and Relu has been applied to the TimeDistributed layers - L2 regularisation has been introduced throughout the GRU and Inception stems to combat overfitting along with dropout of 0.1 in the GRU layers, although work is ongoing on modifying the hyperparameters and experimenting with novel architectutures that may lessen the need for this.

The form of the model containing the Linear Chain CRF can be found in /notebooks/CRF_Syllable_Experimentation.ipynb.

Data format

Data is stored in a text file (/dataset/preprocessed.txt) with each line in the form word,binary:

python,010000

This is a compact representation of the syllable breaks in the word that allows the problem of syllabification to be framed as a multi-label classification task.

p y - t h o n
0 1   0 0 0 0

Statistics

• Peak validation binary accuracy of 98.55% on the Moby Hyphenator II dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
dataset		dataset
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
loss.py		loss.py
main.py		main.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

syllabification

Model architecture

Data format

Statistics

About

Uh oh!

Releases

Packages

Languages

License

josephjojoe/syllabification

Folders and files

Latest commit

History

Repository files navigation

syllabification

Model architecture

Data format

Statistics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages