Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added missing subtoken information #22

Open
wants to merge 43 commits into
base: dev
Choose a base branch
from

Conversation

amir-zeldes
Copy link

  • Some compound tokens, such as אח in אחיו were represented by __
  • Added word form, lemma, and gender information where relevant
  • Note: not sure this all follows guidelines 100%, I find singular אוזן for אזניהם a little odd, but used this form by analogy to רגל in רגליהם.

  * Some compound tokens, such as אח in אחיו were represented by `__`
  * Added word form, lemma, and gender information where relevant
  * compound:smixut -> compound
  * nsubj:cop -> nsubj
  * flat:name -> flat
  * case:.* -> case
  * lemma of זו|זאת|אלו|הללו should be זה
  * leaving זהו/זוהי as a separate lemma זהו
  * copula lemmas הוא, זה get pos=PRON
  * remove verbal features VerbType and VerbForm
  * Handling also fixes some deprel cc/det cases
  * Attempted some heuristics for fixing true CCONJ attachment
  * Some broken inverted cc's remain from original HTB conversion (look for cc pointing forward)
  * also catch אלה
  * separate lemma for הללו
  * כדי is SCONJ+mark except עד כדי
  * number lemmas should have the unmarked feminine forms
  * fix lemma מאות
  * Cases like הדקה ה-90
  * Manually corrected corrupt number text
    * Inverted numbers in source text (years like 9891 = 1989)
    * Repeated prepositions in stead of numbers in text (ככ = כ20)
  * amod ordinals are ADJ not NUM
  * lemma of שנייה in the ordinal sense is שני
  * lemma of רבבות, אחדים changed to רבבה, אחד
  * add Number feature to alphabetic numerals (Plur for non-one, Sing for one, unspecified for zero)
  * left Arabic numerals without Number, since they are often codes, dates and other underspecified cases
  * By analogy to regular article, which carries definiteness (not the NOUN)
  * by analogy to Definite=Def on definite prepositions
  * תל אביב
  * בני ברק
  * ים המלח
  * הארץ (עיתון)
  * הפועל (ספורט)
  * names in 'emek'
  * no gender/number for PROPN matching existing HTB convention
  * most participles used as VERB get Tense=Pres
  * if they govern aux/cop with lemma היה:
    * check that cop shouldn't be aux for conditionals
    * if they have cop, no Tense value (tense on copula)
    * if they have aux, use Aspect=Prog and tense of aux (Past/Fut)
    * if they have a conditional mark (אם, אילו, לו), remove Tense and add Mood=Irr
amir-zeldes and others added 15 commits June 26, 2021 08:22
  * nmod:tmod used for year modifiers of dates
  * reattach year to day, not month, in full dates
  * obl:tmod used for adverbially used NPs without a preposition (next week, five minutes...)
  * affixes like תת, טרום, דו, מולטי
  * POS is ADJ if modifying NOUN
  * POS is ADV if modifying ADJ/ADV
  * May contain some errors
  * Should still be better than previous state
  * Note that impersonal modals are tagged VERB, not AUX
  * Recommend manual review @yifatbm @ivrit
  * No Number for numerals except ones that can pluralize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant