Skip to content

Multi-word tokens in non-English languages #1460

@danielhers

Description

@danielhers

Feature request: include tokenization to syntactic words in default models for non-English languages.
This is necessary for parsing Universal Dependencies: http://universaldependencies.org/format.html#words-tokens-and-empty-nodes, http://universaldependencies.org/conll17/evaluation.html

As expected, spaCy correctly splits English multi-word tokens (which are admittedly already pre-split in the UD data...):

nlp = spacy.load("en")
list(nlp("don't know"))  # -> [do, n't, know]
list(nlp("cannot"))  # -> [can, not]

However, for German and French, for example, the default models do not split multi-word tokens:

nlp = spacy.load("de_core_news_md")
list(nlp("zur Schule"))  # -> [zur, Schule]; expected: [zu, der, Schule]
list(nlp("Jugendtrainer beim Münchner TSV"))  # -> [Jugendtrainer, beim, Münchner, TSV]; expected: [Jugendtrainer, bei, dem, Münchner, TSV]

nlp = spacy.load("fr_depvec_web_lg")
list(fr("Gâteau au chocolat"))  # -> [Gâteau, au, chocolat]; expected: [Gâteau, à, le, chocolat]

Info about spaCy

  • Python version: 3.5.2+
  • spaCy version: 1.9.0
  • Installed models: en_core_web_md, en, de_core_news_md, de, fr_depvec_web_lg
  • Platform: Linux-4.8.4-aufs-1-x86_64-with-debian-stretch-sid

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementFeature requests and improvementshelp wantedContributions welcome!lang / deGerman language data and modelslang / frFrench language data and models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions