Skip to content

Latest commit

 

History

History
78 lines (58 loc) · 3.12 KB

index.md

File metadata and controls

78 lines (58 loc) · 3.12 KB
layout title udver
base
Korean UD
2

UD for Korean

Tokenization and Word Segmentation

  • The tokenization of the Korean UD treebanks follows the tokenization of the Korean data distributed by the SPMRL 2013 shared task, which is a straightforward whitespace-based tokenization with conventional separation of punctuation.
  • There are no words with spaces.
  • There are currently no multi-word tokens. This may change in the future, as some words have no space between them, and instead of indicating this by SpaceAfter=No in MISC, multi-word tokens may be preferable.

Morphology

Lemmas

  • At present, the lemma column in the GSD and Kaist treebanks violates the UD guidelines. Instead of showing a selected surface form as the citation form for the lexeme, it shows the morphemes delimited by the plus (+) character. This should be fixed in future version and a real lemma should be provided.

Tags

  • All 17 universal POS categories are relevant in Korean, including particles (PART). At present, hundreds of word types are tagged PART. This is a legacy of an existing Korean morphological analyzer and many of these words should probably belong to another category in UD; however, the exact list has yet to be worked out.
  • The following words are treated as auxiliaries (AUX):
    • The affirmative copula 이 i (“to be”) is written as a suffix to the nominal predicate but it is treated as a separate auxiliary verb in UD.
    • anh (“to not be”) is the negative copula or an auxiliary in a negative clause.
    • iss (“to be”) is an auxiliary in affirmative clauses.
    • ha (“must, should”) is a necessitative modal auxiliary.
    • sip (“will, to want”) is a desiderative modal auxiliary.

Features


Instruction: Describe inherent and inflectional features for major word classes (at least NOUN and VERB). Describe other noteworthy features. Include links to language-specific feature definitions if any.


Syntax

Core Arguments, Oblique Arguments and Adjuncts

  • Korean uses a nominative-accusative alignment. Direct objects are marked by the accusative morpheme 을 eul.

Relations Overview

Treebanks

There are 3 Korean UD treebanks: