-
Notifications
You must be signed in to change notification settings - Fork 17
Description
What would be the best way to extract tokens that have undergone deletion/epenthesis/some kinds of alternation?
This may work more easily for languages with more transparent writing systems (e.g., Korean) than others, but I was wondering if one can somehow compare the underlying representation (which is sometimes extractable from orthographic transcriptions) and the phone intervals of that word, and query the following cases:
- epenthesis: phones that are not in the underlying representation in that syllable index (nth syllable), or more simply word-initial/final position
- deletion: phones that are in the underlying representation in that syllable index but are not in the phone-level annotation.
- alternations: phones that are X in the underlying representation in that syllable index but are Y in the phone-level annotation in the same index.
e.g., word label: "probably" phone labels: "p-r-o-b-l-i"
we want to say: ba is deleted
and we want to query all words has unstressed syllables (or perhaps syllables after stressed ones) that are deleted in the pronunciation
Currently, we would be able to do this word-by-word: for all word tokens "probably", return words that have fewer than 3 syllables.
maybe if there is a way to encode "predicted syllables from orthography" somehow, and compare that with the actual number of syllables encoded from the phone labels (using maximum onset algorithm), we can at least know something has been deleted/epenthesized?