-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Labels
enhancementFeature requests and improvementsFeature requests and improvementshelp wantedContributions welcome!Contributions welcome!lang / deGerman language data and modelsGerman language data and modelslang / frFrench language data and modelsFrench language data and models
Description
Feature request: include tokenization to syntactic words in default models for non-English languages.
This is necessary for parsing Universal Dependencies: http://universaldependencies.org/format.html#words-tokens-and-empty-nodes, http://universaldependencies.org/conll17/evaluation.html
As expected, spaCy correctly splits English multi-word tokens (which are admittedly already pre-split in the UD data...):
nlp = spacy.load("en")
list(nlp("don't know")) # -> [do, n't, know]
list(nlp("cannot")) # -> [can, not]
However, for German and French, for example, the default models do not split multi-word tokens:
nlp = spacy.load("de_core_news_md")
list(nlp("zur Schule")) # -> [zur, Schule]; expected: [zu, der, Schule]
list(nlp("Jugendtrainer beim Münchner TSV")) # -> [Jugendtrainer, beim, Münchner, TSV]; expected: [Jugendtrainer, bei, dem, Münchner, TSV]
nlp = spacy.load("fr_depvec_web_lg")
list(fr("Gâteau au chocolat")) # -> [Gâteau, au, chocolat]; expected: [Gâteau, à, le, chocolat]
Info about spaCy
- Python version: 3.5.2+
- spaCy version: 1.9.0
- Installed models: en_core_web_md, en, de_core_news_md, de, fr_depvec_web_lg
- Platform: Linux-4.8.4-aufs-1-x86_64-with-debian-stretch-sid
Metadata
Metadata
Assignees
Labels
enhancementFeature requests and improvementsFeature requests and improvementshelp wantedContributions welcome!Contributions welcome!lang / deGerman language data and modelsGerman language data and modelslang / frFrench language data and modelsFrench language data and models