You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+90-55Lines changed: 90 additions & 55 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,17 +9,30 @@
9
9
**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
10
10
11
11
- Supported Python `3.9` to `3.13`.
12
-
- Works offline in low memory mode
12
+
- Works offline with the lite model
13
13
- No `numpy` required (thanks to @dalf).
14
14
15
15
> ### Background
16
16
>
17
17
> This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.
18
18
> For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).
19
19
20
-
> ### Possible memory usage
20
+
> ### Memory note
21
21
>
22
-
> *This library requires at least **200MB memory** in low-memory mode.*
22
+
> The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
23
+
>
24
+
> Approximate memory usage (RSS after load):
25
+
> - Lite: ~45–60 MB
26
+
> - Full: ~170–210 MB
27
+
> - Auto: tries full first, falls back to lite only on MemoryError.
28
+
>
29
+
> Notes:
30
+
> - Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
31
+
> - Validate on your system if constrained; see `examples/memory_usage_check.py` (credit: script by github@JackyHe398`).
32
+
> - Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss),
33
+
> leading to very large peaks with near-zero deltas.
34
+
>
35
+
> Choose the model that best fits your constraints.
23
36
24
37
## Installation 💻
25
38
@@ -39,7 +52,7 @@ pdm add fast-langdetect
39
52
40
53
## Usage 🖥️
41
54
42
-
In scenarios **where accuracy is important**, you should not rely on the detection results of small models, use `low_memory=False`to download larger models!
55
+
For higher accuracy, prefer the full model via `detect(text, model='full')`. For robust behavior under memory pressure, use `detect(text, model='auto')` which falls back to the lite model only on MemoryError.
43
56
44
57
### Prerequisites
45
58
@@ -48,42 +61,73 @@ In scenarios **where accuracy is important**, you should not rely on the detecti
48
61
- Setting `FTLANG_CACHE` environment variable
49
62
- Using `LangDetectConfig(cache_dir="your/path")`
50
63
64
+
### Simple Usage (Recommended)
65
+
66
+
Call by model explicitly — clear and predictable, and use `k` to get multiple candidates. The function always returns a list of results:
67
+
68
+
```python
69
+
from fast_langdetect import detect
70
+
71
+
# Lite model (offline, smaller, faster) — never falls back
cache_dir="/custom/cache/path", # Custom model cache directory
67
-
allow_fallback=True# Enable fallback to small model if large model fails
68
-
)
113
+
# Custom configuration
114
+
config = LangDetectConfig(cache_dir="/custom/cache/path", model="auto") # Custom cache + default model
69
115
detector = LangDetector(config)
70
116
71
-
try:
72
-
result = detector.detect("Hello world", low_memory=False)
73
-
print(result) # {'lang': 'en', 'score': 0.98}
74
-
except DetectError as e:
75
-
print(f"Detection failed: {e}")
117
+
# Omit model to use config.model; pass model to override
118
+
result = detector.detect("Hello world", k=1)
119
+
print(result) # [{'lang': 'en', 'score': 0.98}]
76
120
77
121
# Multiline text is handled automatically (newlines are replaced)
78
122
multiline_text ="Hello, world!\nThis is a multiline text."
79
-
print(detect(multiline_text))
80
-
# Output: {'lang': 'en', 'score': 0.85}
123
+
print(detect(multiline_text, k=1))
124
+
# Output: [{'lang': 'en', 'score': 0.85}]
81
125
82
126
# Multi-language detection
83
-
results =detect_multilingual(
84
-
"Hello 世界 こんにちは",
85
-
low_memory=False, # Use large model for better accuracy
86
-
k=3# Return top 3 languages
127
+
results =detect(
128
+
"Hello 世界 こんにちは",
129
+
model='auto',
130
+
k=3# Return top 3 languages (auto model loading)
87
131
)
88
132
print(results)
89
133
# Output: [
@@ -93,26 +137,17 @@ print(results)
93
137
# ]
94
138
```
95
139
96
-
#### Fallbacks
140
+
#### Fallback Policy (Keep It Simple)
97
141
98
-
We provide a fallback mechanism: when `allow_fallback=True`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.
142
+
- Only `MemoryError` triggers fallback (in `model='auto'`): when loading the full model runs out of memory, it falls back to the lite model.
143
+
- I/O/network/permission/path/integrity errors raise standard exceptions (e.g., `FileNotFoundError`, `PermissionError`) or library-specific errors where applicable — no silent fallback.
144
+
-`model='lite'` and `model='full'` never fallback by design.
99
145
100
-
```python
101
-
# Disable fallback - will raise error if large model fails to load
102
-
# But fallback disabled when custom_model_path is not None, because its a custom model, we will directly use it.
103
-
import tempfile
104
-
config = LangDetectConfig(
105
-
allow_fallback=False,
106
-
custom_model_path=None,
107
-
cache_dir=tempfile.gettempdir(),
108
-
)
109
-
detector = LangDetector(config)
146
+
#### Errors
110
147
111
-
try:
112
-
result = detector.detect("Hello world", low_memory=False)
113
-
except DetectError as e:
114
-
print("Model loading failed and fallback is disabled")
115
-
```
148
+
- Base error: `FastLangdetectError` (library-specific failures).
149
+
- Model loading failures: `ModelLoadError`.
150
+
- Standard Python exceptions (e.g., `ValueError`, `TypeError`, `FileNotFoundError`, `MemoryError`) propagate when they are not library-specific.
result = detector.detect("Hello world", model='auto', k=1)
143
175
```
144
176
145
177
### Splitting Text by Language 🌐
@@ -166,11 +198,14 @@ print(detector.detect("Some very long text..."))
166
198
- When truncation happens, a WARNING is logged because it may reduce accuracy.
167
199
-`max_input_length=80` truncates overly long inputs; set `None` to disable if you prefer no truncation.
168
200
169
-
### Fallback Behavior
201
+
### Cache Directory Behavior
202
+
203
+
- Default cache: if `cache_dir` is not set, models are stored under a system temp-based directory specified by `FTLANG_CACHE` or an internal default. This directory is created automatically when needed.
204
+
- User-provided cache_dir: if you set `LangDetectConfig(cache_dir=...)` to a path that does not exist, the library raises `FileNotFoundError` instead of silently creating or using another location. Create the directory yourself if that’s intended.
205
+
206
+
### Advanced Options (Optional)
170
207
171
-
- As of the latest change, the library only falls back to the bundled small model when a MemoryError occurs while loading the large model.
172
-
- For other errors (e.g., I/O/permission errors, corrupted files, invalid paths), the error is raised as `DetectError` so you can diagnose the root cause quickly.
173
-
- This avoids silently masking real issues and prevents unnecessary re-downloads that can slow execution.
208
+
The constructor exposes a few advanced knobs (`proxy`, `normalize_input`, `max_input_length`). These are rarely needed for typical usage and can be ignored. Prefer `detect(..., model=...)` unless you know you need them.
print(code_to_english_name(result["lang"])) # Portuguese (Brazil) or Portuguese
247
+
result = detect("Olá mundo", model='full', k=1)
248
+
print(code_to_english_name(result[0]["lang"])) # Portuguese (Brazil) or Portuguese
214
249
```
215
250
216
251
Alternatively, `pycountry` can be used for ISO 639 lookups (install with `pip install pycountry`), combined with a small override dict for non-standard tags like `pt-br`, `zh-cn`, `yue`, etc.
0 commit comments