Fix sentence splitting with mecab #43

Heliozoa · 2022-09-28T17:21:54Z

The sentence splitting wasn't working properly with mecab. It seems like the idea was to use the EOS markers, but their name is misleading and mecab doesn't actually split sentences. For example

echo "プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。" | mecab  -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOS\\t3\\t7\\n

will output

プリベット      7       38
通り    2       51
四      8       48
番地    2       53
の      6       24
住人    2       38
ダーズリー      7       38
夫妻    2       38
は      6       16
、      3       9
「      3       5
おかげ  6       38
さ      6       57
まで    6       21
、      3       9
私      2       59
ども    6       51
は      6       16
どこ    6       59
から    6       13
見      2       31
て      6       18
も      6       16
まとも  6       40
な      6       25
人間    2       38
です    6       25
」      3       6
という  6       15
の      6       63
が      6       13
自慢    2       36
だっ    6       25
た      6       25
。      3       7
不思議  2       40
とか    6       23
神秘    2       40
とか    6       23
そんな  6       68
非常識  2       38
は      6       16
まるっきり      6       34
認め    2       31
ない    6       25
人種    2       38
で      6       13
、      3       9
ま      6       2
か      6       22
不思議  2       40
な      6       25
出来事  2       38
が      6       13
彼ら    2       59
の      6       24
周辺    2       38
で      6       13
起こる  2       31
なんて  6       21
、      3       9
とうてい        6       34
考え    2       31
られ    6       32
なかっ  6       25
た      6       25
。      3       7
EOS     3       7

where the only EOS is at the very end to mark the end of that "chunk" of input, even though the text contains two sentences.

The workaround I wrote is pretty hacky but it seems to work well. The node_type 3 signifies a punctuation mark, and I wrote a check to skip characters that shouldn't end a sentence, like commas and quotation marks. The code also handles repeated punctiation marks properly, like a sentence ending in ？！. I'm not very experienced with mysql or php so the only way I could do this was to stuff the variable assignments into the calculations which doesn't look great...

I also had problems with the EOS markers appearing in the final text, so I just dropped them altogether since they're not used in this new code. If this looks acceptable, I can clean up the surrounding code a little and remove some of the EOS checks.

edit: I added a commit that handles more edge cases but now it looks even worse. It also doesn't handle sentences that are separated by others by newlines, such as a line that's surrounded entirely by quotation marks. I don't think there's any way to detect them from mecab's output since it doesn't retain newlines, so the proper solution to this would be to do the word splitting in php. I'm not sure how to implement that though.

HugoFara · 2022-10-14T02:03:53Z

Hi!

Sorry for the long wait, reviewed your commits and tried to understand the implications, since the MeCab integration is quite a fragile part of LWT.

I'm not sure what you try to achieve though. Let's take your text as an example (I modified the second occurrence with a "\n" character):

Here is the input text:

プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。

The following text is manually split:

プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。

不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。

The "EOS" character is interpreted as a line feed character ("\n"), and we don't want a line feed in the parsed sentences if there was no line feed in the original one, for the sake of structure preservation. So for now, I don't really understand what you want to solve.

I also tried to pull your branch, however syntactic characters are now identified as words, which would be an issue:

Heliozoa · 2022-10-14T04:14:36Z

This is the current status quo on the master branch for me with the default language wizard Japanese settings. The text

プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。

ダーズリー氏は、穴あけドリルを製造しているグラニングズ社の社長だ。ずんぐりと肉づきがよい体型のせいで、首がほとんどない。そのかわり巨大な口髭が目立っていた。奥さんの方はやせて、金髪で、なんと首の長さが普通の人の二倍はある。垣根越しにご近所の様子を詮索するのが趣味だったので、鶴のような首は実に便利だった。ダーズリー夫妻にはダドリーという男の子がいた。どこを探したってこんなにできのいい子はいやしない、というのが二人の親バカの意見だった。

そんな絵に描いたように満ち足りたダーズリー家にも、たった一つ秘密があった。なにより怖いのは、誰かにその秘密を嗅ぎつけられることだった。

――あのポッター一家のことが誰かに知られてしまったら一巻の終わりだ。

ポッター夫人はダーズリー夫人の実の妹だが、二人はここ数年一度も会ってはいなかった。それどころか、ダーズリー夫人は妹などいないというふりをしていた。なにしろ、妹もそのろくでなしの夫も、ダーズリー家の家風とはまるっきり正反対だったからだ。

――ポッター一家が不意にこのあたりに現れたら、ご近所の人たちがなんと言うか、考えただけでも身の毛がよだつ。

ポッター家にも小さな男の子がいることを、ダーズリー夫妻は知ってはいたが、ただの一度も会ったことがない。

――そんな子と、うちのダドリーがかかわり合いになるなんて......。

それもポッター一家を遠ざけている理由の一つだった。

when reading looks like

all as one paragraph with EOS inserted in between and punctuation recognised as words.

And the sentence splitting doesn't work, functions more like paragraph split

In the PR branch it looks like this for me while reading

And the sentence split works

HugoFara · 2022-10-14T04:32:05Z

Mmmmh I think I have a better understanding of the issue now.

Basically, one line feed (\n) is swallowed by the parser, so you need two line feeds to get separated sentences, so it explains the big blobs of texts you get on the master branch. However, after more experiments it seems your text is a kind of LWT breaker, because the parsing gets really sensitive to side effects 😅

Currently, I think that the best solution is to review the MeCab parsing. It is getting increasingly complex, but yields poor results. I will try some improvements next week, if it does not work I will simply use your PR.

In the meantime, texts with simple structure work fine, so a temporary solution may be manual editing...

We use EOP (end-of-paragraph) markers instead of EOS (end-of-sentence). Parsing using PHP is a bit slower, but results are better. After 3 months trying to, I do not advise to use a SQL parsing.

HugoFara · 2022-12-27T11:41:43Z

Hi, long time no see!

As expected, it took me a long time to review the Japanese parsing using SQL, and as it often happens with code not following KISS principle, the best solution was to destroy and redo everything.

As far as I investigated, the code was using a lot of SQL glitches in order to assign variables, which led to very complicated debugging. I rewrote the parser using pure PHP, data are then sent to SQL. It's a bit slower, but texts are actually parsed as expected, which is better. I did not manage to get to such results with the SQL parser anyway.

Here is the result

Thanks for the bug report and the PR. Have a great day!

Heliozoa added 3 commits September 28, 2022 20:11

Fix sentence splitting with mecab

fb992ae

Handle more edge cases

d995a60

Handle nested quotes

69923d1

HugoFara added the question Further information is requested label Oct 14, 2022

HugoFara added bug Something isn't working and removed question Further information is requested labels Oct 30, 2022

HugoFara closed this Dec 27, 2022

Heliozoa deleted the sentence-split branch December 27, 2022 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sentence splitting with mecab #43

Fix sentence splitting with mecab #43

Uh oh!

Heliozoa commented Sep 28, 2022 •

edited

Loading

Uh oh!

HugoFara commented Oct 14, 2022 •

edited

Loading

Uh oh!

Heliozoa commented Oct 14, 2022

Uh oh!

HugoFara commented Oct 14, 2022

Uh oh!

HugoFara commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix sentence splitting with mecab #43

Fix sentence splitting with mecab #43

Uh oh!

Conversation

Heliozoa commented Sep 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HugoFara commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Heliozoa commented Oct 14, 2022

Uh oh!

HugoFara commented Oct 14, 2022

Uh oh!

HugoFara commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Heliozoa commented Sep 28, 2022 •

edited

Loading

HugoFara commented Oct 14, 2022 •

edited

Loading