A Vietnamese Word Segmentation program using Neural Network
-
Download the source code
-
Change to
src
folder -
Put a text file into
src
folder for segmenting, here I useinput.txt
-
Run
python3 vietseg.py input.txt output.txt
(Yeah, this program use Python3, and Python2 won't work on it, you can fix this, of course) -
Now you got
output.txt
which is segmented
Precision, Recall, and F1-measure in the same data as described in this paper:
RESULT:
===================
Run 0: P = 0.9156, R = 0.9294, F = 0.9225
Run 1: P = 0.9015, R = 0.9183, F = 0.9099
Run 2: P = 0.9189, R = 0.9327, F = 0.9258
Run 3: P = 0.9208, R = 0.9339, F = 0.9273
Run 4: P = 0.9166, R = 0.9295, F = 0.9230
===================
Avg. P = 0.9147, R = 0.9288, F = 0.9217
And here is the best performance in the paper:
P = 94.00, R = 94.45, F = 94.23
The program use some random shuffers, so your result may not be the same as mine.
-
Get the data (see links below) and put in the
dat
folder -
Change working directory to
src
folder -
Run
python3 word2vec.py
to get vectorized words for our segmenting model using Word2Vec library (Word2Vec itself is a neural network) -
Run
python3 learn.py
to really train the segmenting model -
Run
python3 performace.py
for examining the peformance of the model -
Now you can use
python3 vietseg.py <input file> <output file>
as described above
-
Vietnamese corpus:
- File: VNESEcorpus.txt
- Move the file to
dat
folder
-
Vietnamese IOB training data:
- File: trainingdata.tar.gz
- Untar and put 10 files: test1.iob2 -> test5.iob2, train1.iob2 -> train5.iob2 to
dat
folder, along withVNESEcorpus.txt
- Speed up the network
- Use a professional deep learning package (Theano, Caffe, etc)
- Train the model with bigger corpus and training data file, like these
- Deal with uppercase characters
- Build a web app
This program use some code from wendykan and mnielsen. View the source code for detail.
- JVnSegmenter: Java
- vnTokenizer: Java
- Dongdu: C++
- Roy_VnTokenizer: Python
- VLSP: PHP?
sophisticated algorithm ≤ simple learning algorithm + good training data