Skip to content

Spelled-out numbers #198

Closed
Closed
@coltekin

Description

@coltekin

It may be a trivial question, but I could not find a direct answer or example.

In Turkish corpora, I see quite a few examples of numbers that are completely spelled out. For example, otuz üç 'thirty three'. It looks natural to relate the parts with mwe. However, I saw (for example in English UD treebank) that the numbers like "three million" are marked using compound. This is not a good option for the above example, since it does not have a clear structure, but the decision becomes arbitrary, since we also see examples like iki yüz 'two hundred', and gets difficult if it is iki yüz otuz üc 'two hundred thirty three'.

As I understand, In METU-Sabancı treebank, these were joined together during tokenization.

I am inclined to mark all with mwe with a flat, head-final structure, but afraid of loosing the parallel with the other languages. (Motivation for head-final structure is the same as ones expressed in #189.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions