Skip to content

Commit

Permalink
Update index.md
Browse files Browse the repository at this point in the history
  • Loading branch information
masayu-a authored Aug 15, 2020
1 parent f2e1516 commit c300639
Showing 1 changed file with 32 additions and 14 deletions.
46 changes: 32 additions & 14 deletions _ja/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,11 @@ udver: '2'

* In Japanese there is no obvious word boundary. So we need a definition of words.
As the word definition for universal dependency (UD), we adopt short-unit word
(SUW) [1]. SUW is also adopted to tokenize sentences in Balanced corpus of
contemporary written Japanese (BCCWJ) [2] containing more than 60,000 sentences
(SUW) by NINJAL [1,3]. SUW is also adopted to tokenize sentences in Balanced Corpus of
Contemporary Written Japanese (BCCWJ) [2] containing more than 50,000 sentences
in various domains and it has been shown that the SUW definition covers various
language phenomena in real texts.

* Many SUWs correspond to a single English word but they tend to be shorter than
English counterparts. An example is "フランス 語" (French; French language).
For detailed definition please refer to [3] written in Japanese.

* The automatic tokenization accuracy is more than 98% on in-domain data (BCCWJ) [4].

---

Expand All @@ -37,11 +32,6 @@ Language Resources and Evaluation Vol. 48 345-371, May 2014.
小椋 秀樹, 小磯 花絵, 冨士池 優美, 宮内 佐夜香, 小西 光, and 原 裕,
独立行政法人国立国語研究所, 2011.

[4] Language Resource Addition: Dictionary or Corpus?,
Shinsuke Mori and Graham Neubig,
In Proceedings of the Nineth International Conference on Language Resources and Evaluation, pp. 1631-1636, 2014.


<!-- **Instruction**: Describe the general rules for delimiting words (for example, based on whitespace and punctuation) and exceptions to these rules. Specify whether words with spaces and/or multiword tokens occur. Include links to further language-specific documentation if available.-->

---
Expand All @@ -50,7 +40,18 @@ In Proceedings of the Nineth International Conference on Language Resources and

### Tags

* to be described.
The UD PoS tags in Japanese are converted from UniDic PoS tagset.

The UniDic defines two layered PoS tagsets, one for Short Unit Words and the other for Long Unit Words.
The PoS tagset for Short Unit Words is a 'lexicon-based label'(語彙主義) tagset in which PoS labels imply all possible usages in a context.
In contrast, BCCWJ annotates the 'usage' of PoS as other PoS information.
The PoS tagset for Long Unit Words uses 'usage-based labels'(用法主義) disambiguated by contextual information.
[(小椋ほか 2010a)](http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf)
[(小椋ほか 2010b)](http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf)
Note that , the term 'usage-based' here does not mean the same as in Langacker's Usage-Based model.

- The English Translation of POS Tagset by Dr. Irena Srdanovic
[link](https://gist.github.com/masayu-a/e3eee0637c07d4019ec9)

---
<!-- **Instruction**: Specify any unused tags. Explain what words are tagged as PART. Describe how the AUX-VERB and DET-PRON distinctions are drawn, and specify whether there are (de)verbal forms tagged as ADJ, ADV or NOUN. Include links to language-specific tag definitions if any.-->
Expand All @@ -68,7 +69,24 @@ In Proceedings of the Nineth International Conference on Language Resources and

## Syntax

* to be described.
Japanese syntactic dependency has the following properties.

* Strictly Head Final:
Bunsetsu-based dependencies in Japanese are strictly head final except for apposition and anastrophe (倒置).

* Projective:
Bunsetsu-based dependencies in Japanese are projective except for apposition and non-constituent conjunct coordinations (部分並列).

* Arrow from modifier to head:
In Japanese the NLP community, we depict the dependency arrows from modifier to head.
This is opposite from the standard elsewhere in the world.

We have several annotation schema for dependency annotation. They are labelled but contain very limited syntactic information.
Some syntactic labels in UD are in case frame or semantic role annotation in and are only available in Japanese (see next section).

* Conversion from BCCWJ-DepPara schema:

The BCCWJ-DepPara schema is two-sided: bunsetsu-based dependency using four labels: D for normal dependency, F for filler or no head or face mark, Z for sentence boundary in nested sentences, B for resolution of discrepancy between bunsetsu units; and nested coordination structure and apposition annotation, as in '[Coordination Annotation for the Penn Treebank](https://catalog.ldc.upenn.edu/LDC2015T08)'.

---
<!-- **Instruction**: Give criteria for identifying core arguments (subjects and objects), and describe the range of copula constructions in nonverbal clauses. List all subtype relations used. Include links to language-specific relations definitions if any. -->
Expand Down

0 comments on commit c300639

Please sign in to comment.