Description
It may be a trivial question, but I could not find a direct answer or example.
In Turkish corpora, I see quite a few examples of numbers that are completely spelled out. For example, otuz üç 'thirty three'. It looks natural to relate the parts with mwe
. However, I saw (for example in English UD treebank) that the numbers like "three million" are marked using compound
. This is not a good option for the above example, since it does not have a clear structure, but the decision becomes arbitrary, since we also see examples like iki yüz 'two hundred', and gets difficult if it is iki yüz otuz üc 'two hundred thirty three'.
As I understand, In METU-Sabancı treebank, these were joined together during tokenization.
I am inclined to mark all with mwe
with a flat, head-final structure, but afraid of loosing the parallel with the other languages. (Motivation for head-final structure is the same as ones expressed in #189.)