-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added missing subtoken information #22
Open
amir-zeldes
wants to merge
43
commits into
UniversalDependencies:dev
Choose a base branch
from
amir-zeldes:dev
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Some compound tokens, such as אח in אחיו were represented by `__` * Added word form, lemma, and gender information where relevant
* compound:smixut -> compound * nsubj:cop -> nsubj * flat:name -> flat * case:.* -> case
* lemma of זו|זאת|אלו|הללו should be זה * leaving זהו/זוהי as a separate lemma זהו
* copula lemmas הוא, זה get pos=PRON * remove verbal features VerbType and VerbForm
* Handling also fixes some deprel cc/det cases * Attempted some heuristics for fixing true CCONJ attachment * Some broken inverted cc's remain from original HTB conversion (look for cc pointing forward)
* also catch אלה * separate lemma for הללו
* number lemmas should have the unmarked feminine forms * fix lemma מאות
* Cases like הדקה ה-90
* Manually corrected corrupt number text * Inverted numbers in source text (years like 9891 = 1989) * Repeated prepositions in stead of numbers in text (ככ = כ20) * amod ordinals are ADJ not NUM * lemma of שנייה in the ordinal sense is שני * lemma of רבבות, אחדים changed to רבבה, אחד * add Number feature to alphabetic numerals (Plur for non-one, Sing for one, unspecified for zero) * left Arabic numerals without Number, since they are often codes, dates and other underspecified cases
* By analogy to regular article, which carries definiteness (not the NOUN)
* by analogy to Definite=Def on definite prepositions
* תל אביב * בני ברק * ים המלח * הארץ (עיתון) * הפועל (ספורט)
* names in 'emek' * no gender/number for PROPN matching existing HTB convention
* most participles used as VERB get Tense=Pres * if they govern aux/cop with lemma היה: * check that cop shouldn't be aux for conditionals * if they have cop, no Tense value (tense on copula) * if they have aux, use Aspect=Prog and tense of aux (Past/Fut) * if they have a conditional mark (אם, אילו, לו), remove Tense and add Mood=Irr
* nmod:tmod used for year modifiers of dates * reattach year to day, not month, in full dates * obl:tmod used for adverbially used NPs without a preposition (next week, five minutes...)
* affixes like תת, טרום, דו, מולטי * POS is ADJ if modifying NOUN * POS is ADV if modifying ADJ/ADV
* No Number for numerals except ones that can pluralize
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
אח
inאחיו
were represented by__
אוזן
forאזניהם
a little odd, but used this form by analogy toרגל
inרגליהם
.