Strange results for Chinese with Japanese #38

71sprite · 2023-04-24T09:25:17Z

To reproduce:

package main

import (
	"github.com/pemistahl/lingua-go"
	"fmt"
)

func main() {
	detector := lingua.NewLanguageDetectorBuilder().
		FromAllLanguages().
		Build()

	text := "上海大学是一个好大学. わー!"
	if language, exists := detector.DetectLanguageOf(text); exists {
		fmt.Println(language.String()) // Japanese
	}
}

Expected:
Get Chinese for this case.

https://github.com/pemistahl/lingua-go/blob/main/detector.go#L467

It's because here return Japanese if any japaneseCharacterSet char exists, I'm unsure if this is intended.

Thanks for awesome work!

The text was updated successfully, but these errors were encountered:

pemistahl · 2023-04-25T07:22:18Z

Hi @71sprite, thanks for your request.

I'm aware of the difficulties to recognize Chinese and Japanese correctly. These are actually the most difficult languages. I will try to improve the algorithm but as I'm not a speaker of these languages, it's not easy. If you know how to speak these languages and have ideas for heuristics to implement, I will be glad to read about them.

71sprite · 2023-04-26T08:24:15Z

I have also read some documents List_of_Unicode_characters , it is indeed impossible to accurately distinguish among Chinese, Japanese and Korean. Perhaps we can judge according to the Unicode range.

func isChinese(c rune) bool {
	// Chinese Unicode range
	if (c >= '\u3400' && c <= '\u4db5') || // CJK Unified Ideographs Extension A
		(c >= '\u4e00' && c <= '\u9fed') || // CJK Unified Ideographs
		(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
		return true
	}

	return false
}

func isJapanese(c rune) bool {
	// Japanese Unicode range
	if (c >= '\u3021' && c <= '\u3029') || // Japanese Hanzi
		(c >= '\u3040' && c <= '\u309f') || // Hiragana
		(c >= '\u30a0' && c <= '\u30ff') || // Katakana
		(c >= '\u31f0' && c <= '\u31ff') || // Katakana Phonetic Extension
		(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
		return true
	}

	return false
}

lyricat · 2023-07-22T14:50:40Z

As a speaker of Chinese and Japanese, I vote for @71sprite

pemistahl · 2024-10-01T19:00:24Z

Closed in favor of #68. The Rust implementation will contain improvements for the distinction of Chinese and Japanese in the next version 1.7.0, to be released still in this year.

pemistahl closed this as completed Oct 1, 2024

JellyBrick mentioned this issue Dec 29, 2024

feat(synced-lyrics): romanization th-ch/youtube-music#2790

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange results for Chinese with Japanese #38

Strange results for Chinese with Japanese #38

71sprite commented Apr 24, 2023 •

edited

Loading

pemistahl commented Apr 25, 2023

71sprite commented Apr 26, 2023 •

edited

Loading

lyricat commented Jul 22, 2023

pemistahl commented Oct 1, 2024

Strange results for Chinese with Japanese #38

Strange results for Chinese with Japanese #38

Comments

71sprite commented Apr 24, 2023 • edited Loading

pemistahl commented Apr 25, 2023

71sprite commented Apr 26, 2023 • edited Loading

lyricat commented Jul 22, 2023

pemistahl commented Oct 1, 2024

71sprite commented Apr 24, 2023 •

edited

Loading

71sprite commented Apr 26, 2023 •

edited

Loading