Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Japanese verbs be segmented by morpheme for selection? #212

Open
r12a opened this issue May 22, 2020 · 6 comments
Open

Should Japanese verbs be segmented by morpheme for selection? #212

r12a opened this issue May 22, 2020 · 6 comments
Labels
i:segmentation Grapheme/word segmentation & selection l:ja Japanese question Questions about how Japanese works. These issues should be tracked in i18n-activity tracker. s:jpan Japanese script

Comments

@r12a
Copy link
Contributor

r12a commented May 22, 2020

I've been exploring the results of double-clicking on Japanese text. See a summary of the results of some exploratory tests. See https://github.com/w3c/character_phrase_tests/issues/30 for a discussion about what happens when clicking inside verbs.

Firefox only selects adjacent rows of characters from the same Unicode block when you double-click in the middle of a sentence.

Chrome and Safari, however, apply some logic to the resulting selection, so that if you click inside the word 歩きました (depending on where you click) the browser highlights one of the following morphological segments, 歩き, ま, or した.

The latter is what ICU does. See the ICU segmentation demo page.

My question is whether that's what the average Japanese user wants to happen, or not? Are they happy to be able to select the word root (including incidental hiragana, as in 歩き), or do they get frustrated that they have to constantly extend the selection to get what they want, ie. the whole word?

@r12a r12a added question Questions about how Japanese works. These issues should be tracked in i18n-activity tracker. i:segmentation Grapheme/word segmentation & selection labels May 22, 2020
@kidayasuo
Copy link
Contributor

A good question. I feel the current behaviour (of Safari / Chrome. Firefox is ancient) generally works as I expect esp. in terms of particles and words that has an inflection. They are units when I edit text for improvements.

I once in a while frustrated by compound Kanji words. They typically dissect them in smallest chunks but often the unit I want to edit / replace it the whole compound word. English has the same issue however with compound words. City of “Palo Alto” is two words however they are inseparable. “Palo Alto town hall” is four words, etc. Only the difference is that word boundaries are not visible (+ ambiguities) in Japanese text.

@xfq
Copy link
Member

xfq commented May 24, 2020

I once in a while frustrated by compound Kanji words. They typically dissect them in smallest chunks but often the unit I want to edit / replace it the whole compound word. English has the same issue however with compound words. City of “Palo Alto” is two words however they are inseparable. “Palo Alto town hall” is four words, etc. Only the difference is that word boundaries are not visible (+ ambiguities) in Japanese text.

I agree that a greedy match is a good default behavior. However, there are also cases I just want to match one stem of a compound (like 田舎 in 田舎育ち), and a lazy match would help, so I think it would be useful to make the rules customizable.

@kidayasuo
Copy link
Contributor

I agree. Given expanding the range is easier than reducing the range, I think the current behaviour of matching a smaller semantic unit is a reasonable one.

@r12a
Copy link
Contributor Author

r12a commented May 27, 2020

The thing i was particularly curious about is not so much the separation of 歩き from ました, which i can also see some usefulness for, although it's not the sort of thing that's done in most languages (for example it doesn't happen in Korean). I was particularly curious to know whether also separating ま from した was useful or irritating (to the average Web user)? I can see the logic, but i'd be quite interested to hear if ordinary users appreciate the morphological segmentation that is applied to Japanese.

@kidayasuo
Copy link
Contributor

I think you need to ask elsewhere (than github) if you want to hear what “ordinary users” say ;)

A challenge of asking ordinary users is that they typically do not know/remember what they want/do unless they had extremely pleasant (unlikely with selecting a range) or unpleasant experience. The best way would be to observe what they do rather than just to ask. The flip side is experts. They will explain out of their knowledge, often ignoring what they might actually feel.

It is possible that the separation of ま/した is not optimal. However allowing selecting した seems reasonable if you want to change it to ません (it might be because I know the grammar. I am not certainly an ordinary user), it can still be non-intuitive to ordinary users.

Also, if you take the ability of input methods into account, selecting wider range, e.g. “bunsetsu” segment unit, might make more sense at this point. Because current input methods are not good at converting such a short string, users would often need to type it as a complete bunsetsu even if they had initially selected a shorter range. Future input methods might solve this issue by looking at the surrounding text. Actually some input methods already have this ability however it is still limited.

yes, this is an interesting area to explore.

@murata2makoto
Copy link

@r12a Wakati-gaki is relevant, since it inserts space between small units.

There are several rules of wakati-gaki. I once studied them and found that there is no consensus. I even found that elementary school textbooks are not always consistent. Moreover, dictionaries for computers sometimes have some compound words but do not have others.

@r12a r12a added s:jpan Japanese script l:ja Japanese labels Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i:segmentation Grapheme/word segmentation & selection l:ja Japanese question Questions about how Japanese works. These issues should be tracked in i18n-activity tracker. s:jpan Japanese script
Projects
None yet
Development

No branches or pull requests

4 participants