Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange results for Chinese with Japanese #38

Closed
71sprite opened this issue Apr 24, 2023 · 4 comments
Closed

Strange results for Chinese with Japanese #38

71sprite opened this issue Apr 24, 2023 · 4 comments

Comments

@71sprite
Copy link

71sprite commented Apr 24, 2023

To reproduce:

package main

import (
	"github.com/pemistahl/lingua-go"
	"fmt"
)

func main() {
	detector := lingua.NewLanguageDetectorBuilder().
		FromAllLanguages().
		Build()

	text := "上海大学是一个好大学. わー!"
	if language, exists := detector.DetectLanguageOf(text); exists {
		fmt.Println(language.String()) // Japanese
	}
}

Expected:
Get Chinese for this case.

https://github.com/pemistahl/lingua-go/blob/main/detector.go#L467

It's because here return Japanese if any japaneseCharacterSet char exists, I'm unsure if this is intended.

Thanks for awesome work!

@pemistahl
Copy link
Owner

Hi @71sprite, thanks for your request.

I'm aware of the difficulties to recognize Chinese and Japanese correctly. These are actually the most difficult languages. I will try to improve the algorithm but as I'm not a speaker of these languages, it's not easy. If you know how to speak these languages and have ideas for heuristics to implement, I will be glad to read about them.

@71sprite
Copy link
Author

71sprite commented Apr 26, 2023

I have also read some documents List_of_Unicode_characters , it is indeed impossible to accurately distinguish among Chinese, Japanese and Korean. Perhaps we can judge according to the Unicode range.

func isChinese(c rune) bool {
	// Chinese Unicode range
	if (c >= '\u3400' && c <= '\u4db5') || // CJK Unified Ideographs Extension A
		(c >= '\u4e00' && c <= '\u9fed') || // CJK Unified Ideographs
		(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
		return true
	}

	return false
}

func isJapanese(c rune) bool {
	// Japanese Unicode range
	if (c >= '\u3021' && c <= '\u3029') || // Japanese Hanzi
		(c >= '\u3040' && c <= '\u309f') || // Hiragana
		(c >= '\u30a0' && c <= '\u30ff') || // Katakana
		(c >= '\u31f0' && c <= '\u31ff') || // Katakana Phonetic Extension
		(c >= '\uf900' && c <= '\ufaff') { // CJK Compatibility Ideographs
		return true
	}

	return false
}

@lyricat
Copy link

lyricat commented Jul 22, 2023

As a speaker of Chinese and Japanese, I vote for @71sprite

@pemistahl
Copy link
Owner

Closed in favor of #68. The Rust implementation will contain improvements for the distinction of Chinese and Japanese in the next version 1.7.0, to be released still in this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants