Multi-word tokens in non-English languages

Feature request: include tokenization to syntactic words in default models for non-English languages.
This is necessary for parsing Universal Dependencies: http://universaldependencies.org/format.html#words-tokens-and-empty-nodes, http://universaldependencies.org/conll17/evaluation.html

As expected, spaCy correctly splits English multi-word tokens (which are admittedly already pre-split in the UD data...):

    nlp = spacy.load("en")
    list(nlp("don't know"))  # -> [do, n't, know]
    list(nlp("cannot"))  # -> [can, not]

However, for German and French, for example, the default models do not split multi-word tokens:

    nlp = spacy.load("de_core_news_md")
    list(nlp("zur Schule"))  # -> [zur, Schule]; expected: [zu, der, Schule]
    list(nlp("Jugendtrainer beim Münchner TSV"))  # -> [Jugendtrainer, beim, Münchner, TSV]; expected: [Jugendtrainer, bei, dem, Münchner, TSV]

    nlp = spacy.load("fr_depvec_web_lg")
    list(fr("Gâteau au chocolat"))  # -> [Gâteau, au, chocolat]; expected: [Gâteau, à, le, chocolat]

## Info about spaCy

* **Python version:** 3.5.2+
* **spaCy version:** 1.9.0
* **Installed models:** en_core_web_md, en, de_core_news_md, de, fr_depvec_web_lg
* **Platform:** Linux-4.8.4-aufs-1-x86_64-with-debian-stretch-sid


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-word tokens in non-English languages #1460

Info about spaCy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multi-word tokens in non-English languages #1460

Description

Info about spaCy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions