Skip to content

Latest commit

 

History

History
15 lines (10 loc) · 729 Bytes

tokenization.md

File metadata and controls

15 lines (10 loc) · 729 Bytes
layout title udver
base
Tokenization
2

Tokenization

The French tokenization follows the universal guidelines: contractions are undone (e.g., au becomes two tokens à + le). Otherwise the tokenization is based on white spaces and punctuations (except for symbols - and ' which are not split when they are in a named entity and a single word (Etats-Unis, sous-marin or aujourd'hui are not split).

When the symbol - is used between two different syntactic unit, the - is kept with the second part (usually a pronoun). Ex: vient-ilvient + -il. The quote symbol (') is kept with the previous part. Ex: l'écolel' + école and j'arrivej' + arrive.