Skip to content

feat: No numpy required #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/
50 changes: 26 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,20 @@

## Overview

**fast-langdetect** provides ultra-fast and highly accurate language detection based on FastText, a library developed by
Facebook. This package is 80x faster than traditional methods and offers 95% accuracy.
**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.

It supports Python versions 3.9 to 3.12.
- Supported Python `3.9` to `3.12`.
- Works offline in low memory mode
- No `numpy` required (thanks to @dalf).

Support offline usage.
> ### Background
>
> This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.
> For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).

This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark)
with enhancements in packaging.

For more information on the underlying FastText model, refer to the official
documentation: [FastText Language Identification](https://fasttext.cc/docs/en/language-identification.html).

> [!NOTE]
> This library requires over 200MB of memory to use in low memory mode.
> ### Possible memory usage
>
> *This library requires at least **200MB memory** in low-memory mode.*

## Installation 💻

Expand All @@ -40,20 +39,17 @@ pdm add fast-langdetect

## Usage 🖥️

For optimal performance and accuracy in language detection, use `detect(text, low_memory=False)` to load the larger
model.
In scenarios where accuracy is important, you should not rely on the detection results of small models, use `low_memory=False` to download larger models!

> The model will be downloaded to the `/tmp/fasttext-langdetect` directory upon first use.
### Prerequisites

### Native API (Recommended)
- The “/n” character in the argument string must be removed before calling the function.
- If the sample is too long or too short, the accuracy will be reduced (e.g. if it is too short, Chinese will be predicted as Japanese).
- The model will be downloaded to the `/tmp/fasttext-langdetect` directory upon first use.

> [!NOTE]
> This function assumes to be given a single line of text. *You should remove `\n` characters before passing the text.*
> If the sample is too long or too short, the accuracy will decrease (for example, in the case of too short, Chinese
> will be predicted as Japanese).
### Native API (Recommended)

```python

from fast_langdetect import detect, detect_multilingual

# Single language detection
Expand All @@ -69,15 +65,17 @@ multiline_text = """
Hello, world!
This is a multiline text.
But we need remove `\n` characters or it will raise an ValueError.
REMOVE \n
"""
multiline_text = multiline_text.replace("\n", "") # NOTE:ITS IMPORTANT TO REMOVE \n CHARACTERS
multiline_text = multiline_text.replace("\n", "")
print(detect(multiline_text))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print(detect("Привет, мир!")["lang"])
# Output: ru

# Multi-language detection
# Multi-language detection with low memory mode enabled
# The accuracy is not as good as it should be
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

Expand All @@ -86,6 +84,10 @@ print(detect_multilingual("Hello, world!你好世界!Привет, мир!", low
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]
```

#### Fallbacks

We provide a fallback mechanism: when `use_strict_mode=False`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.

### Convenient `detect_language` Function

```python
Expand Down Expand Up @@ -135,4 +137,4 @@ models
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
```
```
59 changes: 59 additions & 0 deletions feature_test/lingua_t.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
from lingua import LanguageDetectorBuilder

from fast_langdetect import detect_language, detect_multilingual

low_mem_detector = (LanguageDetectorBuilder
.from_all_languages()
.with_low_accuracy_mode()
.with_preloaded_language_models()
.build())
detector = (LanguageDetectorBuilder
.from_all_languages()
.with_preloaded_language_models()
.build())
ja_sentence = "こんにちは世界"
print(detect_language(ja_sentence))
print(low_mem_detector.detect_language_of(ja_sentence).iso_code_639_1.name)
print("===")
ko_sentence = "안녕하세요 세계"
print(detect_language(ko_sentence))
print(low_mem_detector.detect_language_of(ko_sentence).iso_code_639_1.name)
print("===")
fr_sentence = "Bonjour le monde"
print(detect_language(fr_sentence))
print(low_mem_detector.detect_language_of(fr_sentence).iso_code_639_1.name)
print("===")
de_sentence = "Hallo Welt"
print(detect_language(de_sentence))
print(low_mem_detector.detect_language_of(de_sentence).iso_code_639_1.name)
print("===")
zh_sentence = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等"
print(detect_language(zh_sentence))
print(low_mem_detector.detect_language_of(zh_sentence).iso_code_639_1.name)
print("===")
es_sentence = "Hola mundo"
print(detect_language(es_sentence))
print(low_mem_detector.detect_language_of(es_sentence).iso_code_639_1.name)
print("===")

sentence = "こんにちは世界"
for result in detector.detect_multiple_languages_of(sentence):
print(result.language)
print("===")
sentence = """
こんにちは世界
안녕하세요 세계
Hallo Welt
這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等
Bonjour le monde
"""
langs = detect_multilingual(sentence.replace("\n", " "), low_memory=False)
for lang in langs:
print(lang)
confidence_values = detector.compute_language_confidence_values(sentence)
for confidence in confidence_values:
if confidence.value > 0:
print(f"{confidence.language.iso_code_639_1.name}: {confidence.value:.2f}")
print("===")
for result in low_mem_detector.detect_multiple_languages_of(sentence):
print(result.language.iso_code_639_1.name)
Loading