Skip to content

Conversation

@Heliozoa
Copy link

@Heliozoa Heliozoa commented Sep 28, 2022

The sentence splitting wasn't working properly with mecab. It seems like the idea was to use the EOS markers, but their name is misleading and mecab doesn't actually split sentences. For example

echo "プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。" | mecab  -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOS\\t3\\t7\\n

will output

プリベット      7       38
通り    2       51
四      8       48
番地    2       53
の      6       24
住人    2       38
ダーズリー      7       38
夫妻    2       38
は      6       16
、      3       9
「      3       5
おかげ  6       38
さ      6       57
まで    6       21
、      3       9
私      2       59
ども    6       51
は      6       16
どこ    6       59
から    6       13
見      2       31
て      6       18
も      6       16
まとも  6       40
な      6       25
人間    2       38
です    6       25
」      3       6
という  6       15
の      6       63
が      6       13
自慢    2       36
だっ    6       25
た      6       25
。      3       7
不思議  2       40
とか    6       23
神秘    2       40
とか    6       23
そんな  6       68
非常識  2       38
は      6       16
まるっきり      6       34
認め    2       31
ない    6       25
人種    2       38
で      6       13
、      3       9
ま      6       2
か      6       22
不思議  2       40
な      6       25
出来事  2       38
が      6       13
彼ら    2       59
の      6       24
周辺    2       38
で      6       13
起こる  2       31
なんて  6       21
、      3       9
とうてい        6       34
考え    2       31
られ    6       32
なかっ  6       25
た      6       25
。      3       7
EOS     3       7

where the only EOS is at the very end to mark the end of that "chunk" of input, even though the text contains two sentences.

The workaround I wrote is pretty hacky but it seems to work well. The node_type 3 signifies a punctuation mark, and I wrote a check to skip characters that shouldn't end a sentence, like commas and quotation marks. The code also handles repeated punctiation marks properly, like a sentence ending in ?!. I'm not very experienced with mysql or php so the only way I could do this was to stuff the variable assignments into the calculations which doesn't look great...

I also had problems with the EOS markers appearing in the final text, so I just dropped them altogether since they're not used in this new code. If this looks acceptable, I can clean up the surrounding code a little and remove some of the EOS checks.

edit: I added a commit that handles more edge cases but now it looks even worse. It also doesn't handle sentences that are separated by others by newlines, such as a line that's surrounded entirely by quotation marks. I don't think there's any way to detect them from mecab's output since it doesn't retain newlines, so the proper solution to this would be to do the word splitting in php. I'm not sure how to implement that though.

@HugoFara
Copy link
Owner

HugoFara commented Oct 14, 2022

Hi!

Sorry for the long wait, reviewed your commits and tried to understand the implications, since the MeCab integration is quite a fragile part of LWT.

I'm not sure what you try to achieve though. Let's take your text as an example (I modified the second occurrence with a "\n" character):
image

Here is the input text:

プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。

The following text is manually split:

プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。

不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。

The "EOS" character is interpreted as a line feed character ("\n"), and we don't want a line feed in the parsed sentences if there was no line feed in the original one, for the sake of structure preservation. So for now, I don't really understand what you want to solve.

I also tried to pull your branch, however syntactic characters are now identified as words, which would be an issue:
image

@HugoFara HugoFara added the question Further information is requested label Oct 14, 2022
@Heliozoa
Copy link
Author

This is the current status quo on the master branch for me with the default language wizard Japanese settings. The text

プリベット通り四番地の住人ダーズリー夫妻は、「おかげさまで、私どもはどこから見てもまともな人間です」というのが自慢だった。不思議とか神秘とかそんな非常識はまるっきり認めない人種で、まか不思議な出来事が彼らの周辺で起こるなんて、とうてい考えられなかった。

ダーズリー氏は、穴あけドリルを製造しているグラニングズ社の社長だ。ずんぐりと肉づきがよい体型のせいで、首がほとんどない。そのかわり巨大な口髭が目立っていた。奥さんの方はやせて、金髪で、なんと首の長さが普通の人の二倍はある。垣根越しにご近所の様子を詮索するのが趣味だったので、鶴のような首は実に便利だった。ダーズリー夫妻にはダドリーという男の子がいた。どこを探したってこんなにできのいい子はいやしない、というのが二人の親バカの意見だった。

そんな絵に描いたように満ち足りたダーズリー家にも、たった一つ秘密があった。なにより怖いのは、誰かにその秘密を嗅ぎつけられることだった。

――あのポッター一家のことが誰かに知られてしまったら一巻の終わりだ。

ポッター夫人はダーズリー夫人の実の妹だが、二人はここ数年一度も会ってはいなかった。それどころか、ダーズリー夫人は妹などいないというふりをしていた。なにしろ、妹もそのろくでなしの夫も、ダーズリー家の家風とはまるっきり正反対だったからだ。

――ポッター一家が不意にこのあたりに現れたら、ご近所の人たちがなんと言うか、考えただけでも身の毛がよだつ。

ポッター家にも小さな男の子がいることを、ダーズリー夫妻は知ってはいたが、ただの一度も会ったことがない。

――そんな子と、うちのダドリーがかかわり合いになるなんて......。

それもポッター一家を遠ざけている理由の一つだった。

when reading looks like
image
all as one paragraph with EOS inserted in between and punctuation recognised as words.

And the sentence splitting doesn't work, functions more like paragraph split
image

In the PR branch it looks like this for me while reading
image

And the sentence split works
image

@HugoFara
Copy link
Owner

Mmmmh I think I have a better understanding of the issue now.

Basically, one line feed (\n) is swallowed by the parser, so you need two line feeds to get separated sentences, so it explains the big blobs of texts you get on the master branch. However, after more experiments it seems your text is a kind of LWT breaker, because the parsing gets really sensitive to side effects 😅

Currently, I think that the best solution is to review the MeCab parsing. It is getting increasingly complex, but yields poor results. I will try some improvements next week, if it does not work I will simply use your PR.

In the meantime, texts with simple structure work fine, so a temporary solution may be manual editing...

@HugoFara HugoFara added bug Something isn't working and removed question Further information is requested labels Oct 30, 2022
HugoFara added a commit that referenced this pull request Dec 16, 2022
We use EOP (end-of-paragraph) markers instead of EOS (end-of-sentence).
Parsing using PHP is a bit slower, but results are better.
After 3 months trying to, I do not advise to use a SQL parsing.
@HugoFara
Copy link
Owner

Hi, long time no see!

As expected, it took me a long time to review the Japanese parsing using SQL, and as it often happens with code not following KISS principle, the best solution was to destroy and redo everything.

As far as I investigated, the code was using a lot of SQL glitches in order to assign variables, which led to very complicated debugging. I rewrote the parser using pure PHP, data are then sent to SQL. It's a bit slower, but texts are actually parsed as expected, which is better. I did not manage to get to such results with the SQL parser anyway.

Here is the result
image

Thanks for the bug report and the PR. Have a great day!

@HugoFara HugoFara closed this Dec 27, 2022
@Heliozoa Heliozoa deleted the sentence-split branch December 27, 2022 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants