-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer #2155
Conversation
Hi @trungtv, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the If you've already included the Contributor Agreement in your pull request above, you can ignore this message. |
Thanks! Looks great! Actually in v2.1 we should be able to supply nice Vietnamese support in spaCy itself as well. If you checkout the develop branch and download the Universal Dependencies corpora, you should be able to do: spacy ud-train /path/to/ud-treebanks-conll2017 /tmp/parses /path/to/config.json UD_Vietnamese An example config.json is:
(Note batch size is in number of words, not documents or sentences!) I've attached an example accuracy log below. The parser on the develop branch (which will be in v2.1) learns to assign the label I'm really pleased with how accuracy is coming along for Vietnamese, so I'm very glad you submitted this: with the pyvi option for segmentation and the language data, we should be able to offer much better support for Vietnamese than other tools. I'm also hoping that the joint segmentation and parsing strategy can allow some semi-supervised learning. It may be that the word segmenter like pyvi makes different errors from the parser. If so, we may be able to find a way to partially train the parser on its output. Train and evaluate UD_Vietnamese using lang vi
Epoch Loss LAS UAS TAG SENT WORD
0 3189.4 7.7 11.3 46.5 85.1 60.5
1 2242.0 18.5 22.7 60.2 90.4 69.2
2 1969.0 28.4 33.6 68.7 92.9 77.8
3 1491.5 34.1 39.9 72.2 94.2 81.8
4 1293.8 37.8 43.8 74.7 94.0 84.3
5 1102.9 39.5 45.2 75.2 94.6 85.1
6 1068.9 41.6 47.1 75.4 95.0 85.5
7 930.8 43.1 48.8 76.0 94.6 86.2
8 858.3 43.9 49.6 76.4 95.6 86.7
9 751.6 44.8 50.5 76.8 95.4 86.8
10 700.6 43.7 49.2 76.4 96.1 86.6
11 688.9 44.7 50.3 76.6 96.1 87.0
12 543.6 44.8 50.4 76.5 95.7 86.9
13 471.9 45.1 50.6 76.7 96.1 87.1
14 420.6 45.6 51.2 76.6 95.9 87.4
15 398.9 46.2 51.5 76.5 95.9 87.2
16 374.8 46.0 51.3 76.5 95.6 87.2
17 338.5 46.0 51.5 76.5 95.7 87.3
18 307.7 45.7 51.1 76.4 95.9 87.0
19 260.2 46.0 51.4 76.6 95.7 87.2
20 275.7 46.1 51.5 76.4 95.9 87.0
21 308.0 46.1 51.5 76.6 95.6 87.4
22 231.5 46.4 52.0 76.8 95.3 87.7
23 216.4 46.3 52.0 76.8 95.1 87.6
24 183.5 46.1 51.9 76.8 95.6 87.7
25 157.0 46.2 52.1 76.9 95.7 87.7
26 182.6 46.1 52.0 76.8 95.2 87.7 |
Wow, the future looks bright for our Vietnamese NLP community. Up to now, I can tell we had not have a robust, industrial strength nlp toolkit for Vietnamese. Don't hesitate to guide us making this meaningful milestone. Taking about man power, other guys from our (research/applied) Vietnamese NLP community, and also from underthesea already agreed to give hands. |
What's your PyVI model trained on? Is the token F1 score the same metric as the UD word segmentation evaluation? At first I wondered whether the difference was that the CoNLL 2017 scoring script was evaluating whole words, while you're evaluating the surface tokens. But the CoNLL script outputs the same score for "word" and "token" F1, so I'm a bit confused. The CoNLL 2017 evaluation showed quite low word F1 for all participants. Our current score of 88% would put us at the top of the pack, but it's much below the F1 you're reporting. Is it a different data set, or is it just that none of the participants took Vietnamese-specific measures that easily fix the accuracy problems? There was a team from Facebook who submitted a CRF-based system in CoNLL 2017. Their score on Vietnamese was very low, while they scored first in Chinese. So it does seem possible they made one more mistake on Vietnamese than everyone else, and you've done one fewer :). Or it could just be the datasets... |
Merging this as I'm keen to start playing with the model! |
My model was trained on different dataset (Vietnamese treebank). Also, there is a common practice in our Vietnamese NLP community to join compound word after tokenized by '_'. For example: "Bách Khoa Hà Nội" (en: Hanoi University of Science and Technology) would become "Bách_Khoa Hà_Nội". We can then leverage tokenized text where words are separated by space as in other languages directly for word2vec training in gensim. |
In spaCy the Either way it will be easy to output a Gensim-friendly version by replacing the spaces within each Would it be possible to license the Vietnamese treebank for commercial purposes? I know this is often tricky when multiple institutions are involved. |
Your are right. It'd better to have underscored version in the |
Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer
Description
Please give a look at my compiled model: vi_core_news_md
Types of change
It is a new feature to add support for Vietnamese on spaCy
Checklist