Trainable Tokenizer #2220

mosynaq · 2018-04-14T08:20:44Z

mosynaq
Apr 14, 2018

Hi there! I am trying to train Persian models for spaCy and have done a lot so far. But the tokenization part suffers some deficiencies: It cannot recognize clitics and split them and since those clitics connect to a large number of words, this phenomenon cannot be described using RegEx. Conllu tree banks provide a good deal of information about tokens and their boundaries. See below for a good example

# text = صورتش را دیدم.
1-2	صورتش	_	_	_	_	_	_	_	_
1	صورت	صورت	NOUN	N_SING	Number=Sing	4	obj	_	_
2	ش	او	PRON	PRO	Number=Sing|Person=3|PronType=Prs	1	nmod:poss	_	_
3	را	را	PART	CLITIC	_	1	case	_	_
4	دیدم	دید#دیدن	VERB	V_PA	Number=Sing|Person=1|Tense=Past	0	root	_	SpaceAfter=No
5	.	.	PUNCT	DELM	_	4	punct	_	SpaceAfter=No

The above sample in picture:

The sentence means literally : "His/her-face object-marker-token I-saw." meaning "I saw his/her face.".
In the first aolumn you can see "1-2" with the token and after that you can see the token properly tokenized. I believe this tree and others are a great source of information for a trainable tokenizer.

Questions

Does spaCy have such ability under other names?
Is there any workarounds?
Can this trainable tokenizer be used in accordance with spaCy's rule-based tokenzer?

p.s.

UDPipe Project has exploited such feature in conllu trees so I believe it is possible for spaCy too.
This tokenizer can be implemented using UDPipe REST server, but I prefer spaCy for several reasons.

Your Environment

Operating System: Linux-4.13.0-32-generic-x86_64-with-LinuxMint-18.3-sylvia
Python Version Used: 2.7.12 and 3.5
spaCy Version Used: 2.0.11
Models: en, fa, es

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainable Tokenizer #2220

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Trainable Tokenizer #2220

mosynaq Apr 14, 2018

Questions

p.s.

Your Environment

Replies: 0 comments

mosynaq
Apr 14, 2018