layout	title	udver
base	Tokenization	2

Tokenization

The French tokenization follows the universal guidelines: contractions are undone (e.g., au becomes two tokens à + le). Otherwise the tokenization is based on white spaces and punctuations (except for symbols - and ' which are not split when they are in a named entity and a single word (Etats-Unis, sous-marin or aujourd'hui are not split).

When the symbol - is used between two different syntactic unit, the - is kept with the second part (usually a pronoun). Ex: vient-il → vient + -il. The quote symbol (') is kept with the previous part. Ex: l'école → l' + école and j'arrive → j' + arrive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization.md

tokenization.md

Tokenization

Files

tokenization.md

Latest commit

History

tokenization.md

File metadata and controls

Tokenization