This project converts JMdictFurigana alignment data into bracketed furigana (i.e. 学校 → 学[がっ] 校[こう]) and then converts 音読み readings to katakana using KANJIDIC2. Output as a CSV, the tooling is general-purpose and can be used for other dictionary processing workflows.
Mistakes and misclassification reports are welcome — please open an issue.
- can account for 促音 readings
- correctly identifies readings of repeated kanji using the 々 character
- correctly handles voice readings (はん→ばん)
- normalises JMDictFurigana entries to hiragana to prevent premature false-negatives
- deals with non-official renyoukei (noun-from-verb) readings
- can add manual
additional_kanji_readings.csvandmanual_onkunyomi.csvfiles to fix individual cases - correctly handles mizenkei (上げる→上げ)
- filters out entries that don't contain kanji or are metkanji (々〆)
- handles abbreviated kun readings misclassified as onyomi
うなぎ丼[どんぶり] (correct)
うな丼[ドン] (abbreviation of kun reading mistaken as rendaku of on reading) (wrong)
| tag | meaning |
|---|---|
on |
at least one kanji[reading] segment was classified as on-yomi. converted to katakana. |
kun |
at least one kanji[reading] segment was classified as kun-yomi. no change. |
unknown |
1. one or more kanji[reading] segments could not be matched to either on or kun readings in KANJIDIC2. no change.2. one or more 々 characters were found but no repeated kanji could be identified |
ambiguous |
one or more kanji[reading] segments matched both the computed on and kun variants sets for that kanji. unique classification not possible. no change. |
irregular |
1. the mapping included a multi-kanji bracket segment (e.g. 今日[きょう]) 2. the dictionary bracket-building mapping had a multi-character span (e.g. 3-4) assumed to be an ateji, jukujikun or nanori reading. classification not attemped. |
source_katakana |
the original bracket reading in the source dictionary contained katakana. usually indicates gairaigo. no change. |
manual_onkunyomi |
a word-specific override from manual_onkunyomi.csv was applied |
additional_kanji_reading |
a kanji-level override from additional_kanji_readings.csv was applied |
suspect_abbrev |
an on-classified kanji reading that is a proper prefix of one or more of that kanji's kun readings was identified. used to find likely abbreviations of kun readings that may be misclassified as on. no change. |
abbrev_on_to_kun |
treated a suspect_abbrev case as kun. automatically changed classification to kun. |
| version | count | tag:unknown |
tag:ambiguous |
tag:irregular |
-tag:unknown-tag:ambiguous |
tag:unknown or tag:ambiguous |
overlap |
|---|---|---|---|---|---|---|---|
| poc | 229,833 | 34,410 (14.97%) |
5,915 (2.57%) |
6,534 (2.84%) |
189,926 (82.64%) |
39,907 (17.36%) |
418 (0.18%) |
| 令和8年1月26日 | 228,277 | 5,825 (2.55%) |
6,097 (2.67%) |
6,534 (2.86%) |
216,407 (94.80%) |
11,870 (5.20%) |
52 (0.02%) |
Dictionary entries that are believed to be input errors or furigana misalignments will be filtered out before classification not included in the final .csv.
All source code in this repository is licensed under the MIT License. See LICENSE.
This project does not bundle any upstream dictionary files. The build workflow downloads dictionary data from the original projects and generates outputs from that data.
KANJIDIC2 (kanji readings; 音読み/訓読み metadata)
EDRDG KANJIDIC Project
JMdictFurigana (furigana alignment data used to split readings by character)
JMdictFurigana
Please refer to the upstream projects for the authoritative licenses, terms of use and attribution requirements.
This repository publishes *.csv and *.apkg files generated from the provided onkunfuri.py file. These generated files are derived from upstream dictionary data such as JMdictFurigana, KANJIDIC2, and related EDRDG resources. As a result:
- The MIT License applies only to the code, not to the upstream dictionary data.
- The generated CSV and APKG files may be considered derivative works and are therefore distributed under the license terms of the upstream sources, including any attribution requirements and restrictions.
- If you redistribute the generated outputs, you are responsible for complying with the upstream licenses and terms.
This project is provided “as is”, without warranty of any kind. Dictionary-derived readings and 音/訓 classifications may contain errors or ambiguities such as, but not limited to, 熟字訓, 当て字, 連濁 and historical or ambiguous readings. Always verify critical information against authoritative references.