Cedict reader #138

jlowryduda · 2017-10-18T16:00:56Z

No description provided.

rspeer · 2017-10-31T20:18:57Z

conceptnet5/readers/cc_cedict.py

+DATE_RANGE_REGEX = re.compile(r'(.+?)\s\(.+\d.+\),')  # date range
+PAREN_REGEX = re.compile(r'\(.+?\)')  # parenthesis
+CHINESE_CHAR_REGEX = re.compile(r'([\u4e00-\u9fff]+[\|·]?)+')  # Chinese characters
+BRACKETS_REGEX = re.compile(r'\[.+\]')  # pronunciation


I think this regex is too greedy and should have a ? on it like PAREN_REGEX does.

I believe that in this definition:

一甲一甲 [yi1 jia3] /1st rank or top three candidates who passed the imperial examination (i.e. 狀元|状元[zhuang4 yuan2], 榜眼[bang3 yan3], and 探花[tan4 hua1], respectively)/

it will match this text:

[zhuang4 yuan2], 榜眼[bang3 yan3], and 探花[tan4 hua1]

This particular definition would actually have everything inside the parentheses removed before matching the brackets, but the problem was true for a couple of other definitions, so I changed it.

rspeer · 2017-10-31T20:27:47Z

conceptnet5/readers/cc_cedict.py

+LINE_REGEX = re.compile(r'(.+)\s(.+)\[.+\]\s/(.+)/')  # separate traditional and simplified words
+DATE_RANGE_REGEX = re.compile(r'(.+?)\s\(.+\d.+\),')  # date range
+PAREN_REGEX = re.compile(r'\(.+?\)')  # parenthesis
+CHINESE_CHAR_REGEX = re.compile(r'([\u4e00-\u9fff]+[\|·]?)+')  # Chinese characters


This appears to exclude the range from U+3400..U+4DBF, which appears in definitions such as:

㐌㐌 [ta1] /variant of 它[ta1]/

For what it's worth, it also excludes the other CJK extensions with codepoints U+20000 and up, but CEDICT never uses those anyway.

That was correct for a couple of definitions, so I ended up switching to regex.compile('([\p{IsIdeo}]+[\|·]?)+'), per your suggestion.

jlowryduda added 7 commits September 19, 2017 17:43

add cc_cedict to Snakefile, cli

0ae1ffc

rules capturing people, measure words, variants

883041e

make edges of extracted information

57c4bc3

more documentation, style changes

c4c53bd

compile all regex

4cbe53d

load from a gz file instead of a text file

349be0f

add newline at the end of a file

7c93cec

rspeer reviewed Oct 31, 2017

View reviewed changes

better Han characters regex, less greedy bracket regex

591c045

rspeer merged commit 620b713 into master Nov 15, 2017

rspeer deleted the cedict-reader branch November 15, 2017 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cedict reader #138

Cedict reader #138

jlowryduda commented Oct 18, 2017

rspeer Oct 31, 2017

jlowryduda Nov 1, 2017

rspeer Oct 31, 2017

rspeer Oct 31, 2017

jlowryduda Nov 1, 2017

Cedict reader #138

Cedict reader #138

Conversation

jlowryduda commented Oct 18, 2017

rspeer Oct 31, 2017

Choose a reason for hiding this comment

jlowryduda Nov 1, 2017

Choose a reason for hiding this comment

rspeer Oct 31, 2017

Choose a reason for hiding this comment

rspeer Oct 31, 2017

Choose a reason for hiding this comment

jlowryduda Nov 1, 2017

Choose a reason for hiding this comment