Skip to content

Commit 9a828e4

Browse files
authored
Merge pull request #22 from LlmKira/fix/0915
♻️ refactor(tests/infer): Unify detect functions, make the API more intuitive and avoid implicit fallbacks
2 parents c5bd901 + 197ed22 commit 9a828e4

File tree

10 files changed

+383
-352
lines changed

10 files changed

+383
-352
lines changed

README.md

Lines changed: 90 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,30 @@
99
**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
1010

1111
- Supported Python `3.9` to `3.13`.
12-
- Works offline in low memory mode
12+
- Works offline with the lite model
1313
- No `numpy` required (thanks to @dalf).
1414

1515
> ### Background
1616
>
1717
> This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.
1818
> For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).
1919
20-
> ### Possible memory usage
20+
> ### Memory note
2121
>
22-
> *This library requires at least **200MB memory** in low-memory mode.*
22+
> The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
23+
>
24+
> Approximate memory usage (RSS after load):
25+
> - Lite: ~45–60 MB
26+
> - Full: ~170–210 MB
27+
> - Auto: tries full first, falls back to lite only on MemoryError.
28+
>
29+
> Notes:
30+
> - Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
31+
> - Validate on your system if constrained; see `examples/memory_usage_check.py` (credit: script by github@JackyHe398`).
32+
> - Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss),
33+
> leading to very large peaks with near-zero deltas.
34+
>
35+
> Choose the model that best fits your constraints.
2336
2437
## Installation 💻
2538

@@ -39,7 +52,7 @@ pdm add fast-langdetect
3952

4053
## Usage 🖥️
4154

42-
In scenarios **where accuracy is important**, you should not rely on the detection results of small models, use `low_memory=False` to download larger models!
55+
For higher accuracy, prefer the full model via `detect(text, model='full')`. For robust behavior under memory pressure, use `detect(text, model='auto')` which falls back to the lite model only on MemoryError.
4356

4457
### Prerequisites
4558

@@ -48,42 +61,73 @@ In scenarios **where accuracy is important**, you should not rely on the detecti
4861
- Setting `FTLANG_CACHE` environment variable
4962
- Using `LangDetectConfig(cache_dir="your/path")`
5063

64+
### Simple Usage (Recommended)
65+
66+
Call by model explicitly — clear and predictable, and use `k` to get multiple candidates. The function always returns a list of results:
67+
68+
```python
69+
from fast_langdetect import detect
70+
71+
# Lite model (offline, smaller, faster) — never falls back
72+
print(detect("Hello", model='lite', k=1)) # -> [{'lang': 'en', 'score': ...}]
73+
74+
# Full model (downloaded to cache, higher accuracy) — never falls back
75+
print(detect("Hello", model='full', k=1)) # -> [{'lang': 'en', 'score': ...}]
76+
77+
# Auto mode: try full, fallback to lite only on MemoryError
78+
print(detect("Hello", model='auto', k=1)) # -> [{'lang': 'en', 'score': ...}]
79+
80+
# Multilingual: top 3 candidates (always a list)
81+
print(detect("Hello 世界 こんにちは", model='auto', k=3))
82+
```
83+
84+
If you need a custom cache directory, pass `LangDetectConfig`:
85+
86+
```python
87+
from fast_langdetect import LangDetectConfig, detect
88+
89+
cfg = LangDetectConfig(cache_dir="/custom/cache/path")
90+
print(detect("Hello", model='full', config=cfg))
91+
92+
# Set a default model via config and let calls omit model
93+
cfg_lite = LangDetectConfig(model="lite")
94+
print(detect("Hello", config=cfg_lite)) # uses lite by default
95+
print(detect("Bonjour", config=cfg_lite)) # uses lite by default
96+
print(detect("Hello", model='full', config=cfg_lite)) # per-call override to full
97+
98+
```
99+
51100
### Native API (Recommended)
52101

53102
```python
54-
from fast_langdetect import detect, detect_multilingual, LangDetector, LangDetectConfig, DetectError
103+
from fast_langdetect import detect, LangDetector, LangDetectConfig
55104

56-
# Simple detection
57-
print(detect("Hello, world!"))
58-
# Output: {'lang': 'en', 'score': 0.12450417876243591}
105+
# Simple detection (uses config default if not provided; defaults to 'auto')
106+
print(detect("Hello, world!", k=1))
107+
# Output: [{'lang': 'en', 'score': 0.98}]
59108

60-
# Using large model for better accuracy
61-
print(detect("Hello, world!", low_memory=False))
62-
# Output: {'lang': 'en', 'score': 0.98765432109876}
109+
# Using full model for better accuracy
110+
print(detect("Hello, world!", model='full', k=1))
111+
# Output: [{'lang': 'en', 'score': 0.99}]
63112

64-
# Custom configuration with fallback mechanism
65-
config = LangDetectConfig(
66-
cache_dir="/custom/cache/path", # Custom model cache directory
67-
allow_fallback=True # Enable fallback to small model if large model fails
68-
)
113+
# Custom configuration
114+
config = LangDetectConfig(cache_dir="/custom/cache/path", model="auto") # Custom cache + default model
69115
detector = LangDetector(config)
70116

71-
try:
72-
result = detector.detect("Hello world", low_memory=False)
73-
print(result) # {'lang': 'en', 'score': 0.98}
74-
except DetectError as e:
75-
print(f"Detection failed: {e}")
117+
# Omit model to use config.model; pass model to override
118+
result = detector.detect("Hello world", k=1)
119+
print(result) # [{'lang': 'en', 'score': 0.98}]
76120

77121
# Multiline text is handled automatically (newlines are replaced)
78122
multiline_text = "Hello, world!\nThis is a multiline text."
79-
print(detect(multiline_text))
80-
# Output: {'lang': 'en', 'score': 0.85}
123+
print(detect(multiline_text, k=1))
124+
# Output: [{'lang': 'en', 'score': 0.85}]
81125

82126
# Multi-language detection
83-
results = detect_multilingual(
84-
"Hello 世界 こんにちは",
85-
low_memory=False, # Use large model for better accuracy
86-
k=3 # Return top 3 languages
127+
results = detect(
128+
"Hello 世界 こんにちは",
129+
model='auto',
130+
k=3 # Return top 3 languages (auto model loading)
87131
)
88132
print(results)
89133
# Output: [
@@ -93,26 +137,17 @@ print(results)
93137
# ]
94138
```
95139

96-
#### Fallbacks
140+
#### Fallback Policy (Keep It Simple)
97141

98-
We provide a fallback mechanism: when `allow_fallback=True`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.
142+
- Only `MemoryError` triggers fallback (in `model='auto'`): when loading the full model runs out of memory, it falls back to the lite model.
143+
- I/O/network/permission/path/integrity errors raise standard exceptions (e.g., `FileNotFoundError`, `PermissionError`) or library-specific errors where applicable — no silent fallback.
144+
- `model='lite'` and `model='full'` never fallback by design.
99145

100-
```python
101-
# Disable fallback - will raise error if large model fails to load
102-
# But fallback disabled when custom_model_path is not None, because its a custom model, we will directly use it.
103-
import tempfile
104-
config = LangDetectConfig(
105-
allow_fallback=False,
106-
custom_model_path=None,
107-
cache_dir=tempfile.gettempdir(),
108-
)
109-
detector = LangDetector(config)
146+
#### Errors
110147

111-
try:
112-
result = detector.detect("Hello world", low_memory=False)
113-
except DetectError as e:
114-
print("Model loading failed and fallback is disabled")
115-
```
148+
- Base error: `FastLangdetectError` (library-specific failures).
149+
- Model loading failures: `ModelLoadError`.
150+
- Standard Python exceptions (e.g., `ValueError`, `TypeError`, `FileNotFoundError`, `MemoryError`) propagate when they are not library-specific.
116151

117152
### Convenient `detect_language` Function
118153

@@ -134,12 +169,9 @@ print(detect_language("你好,世界!"))
134169

135170
```python
136171
# Load model from local file
137-
config = LangDetectConfig(
138-
custom_model_path="/path/to/your/model.bin", # Use local model file
139-
disable_verify=True # Skip MD5 verification
140-
)
172+
config = LangDetectConfig(custom_model_path="/path/to/your/model.bin")
141173
detector = LangDetector(config)
142-
result = detector.detect("Hello world")
174+
result = detector.detect("Hello world", model='auto', k=1)
143175
```
144176

145177
### Splitting Text by Language 🌐
@@ -166,11 +198,14 @@ print(detector.detect("Some very long text..."))
166198
- When truncation happens, a WARNING is logged because it may reduce accuracy.
167199
- `max_input_length=80` truncates overly long inputs; set `None` to disable if you prefer no truncation.
168200

169-
### Fallback Behavior
201+
### Cache Directory Behavior
202+
203+
- Default cache: if `cache_dir` is not set, models are stored under a system temp-based directory specified by `FTLANG_CACHE` or an internal default. This directory is created automatically when needed.
204+
- User-provided cache_dir: if you set `LangDetectConfig(cache_dir=...)` to a path that does not exist, the library raises `FileNotFoundError` instead of silently creating or using another location. Create the directory yourself if that’s intended.
205+
206+
### Advanced Options (Optional)
170207

171-
- As of the latest change, the library only falls back to the bundled small model when a MemoryError occurs while loading the large model.
172-
- For other errors (e.g., I/O/permission errors, corrupted files, invalid paths), the error is raised as `DetectError` so you can diagnose the root cause quickly.
173-
- This avoids silently masking real issues and prevents unnecessary re-downloads that can slow execution.
208+
The constructor exposes a few advanced knobs (`proxy`, `normalize_input`, `max_input_length`). These are rarely needed for typical usage and can be ignored. Prefer `detect(..., model=...)` unless you know you need them.
174209

175210
### Language Codes → English Names
176211

@@ -209,8 +244,8 @@ def code_to_english_name(code: str) -> str:
209244

210245
# Usage
211246
from fast_langdetect import detect
212-
result = detect("Olá mundo", low_memory=False)
213-
print(code_to_english_name(result["lang"])) # Portuguese (Brazil) or Portuguese
247+
result = detect("Olá mundo", model='full', k=1)
248+
print(code_to_english_name(result[0]["lang"])) # Portuguese (Brazil) or Portuguese
214249
```
215250

216251
Alternatively, `pycountry` can be used for ISO 639 lookups (install with `pip install pycountry`), combined with a small override dict for non-standard tags like `pt-br`, `zh-cn`, `yue`, etc.

examples/memory_usage_check.py

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Measure memory behavior when loading fast-langdetect models.
4+
5+
Credit: script prepared by github@JackyHe398 (adapted for examples/).
6+
7+
Examples
8+
9+
# Check lite model without limiting memory
10+
python examples/memory_usage_check.py --model lite
11+
12+
# Check full model with a 200 MB limit (should pass on many systems)
13+
python examples/memory_usage_check.py --model full --limit-mb 200
14+
15+
# Force fallback or failure by using a tight limit
16+
python examples/memory_usage_check.py --model full --limit-mb 100
17+
18+
Notes
19+
- RSS measurement uses ru_maxrss which is OS-dependent (kB on Linux, bytes on macOS).
20+
- Address space limits rely on resource.RLIMIT_AS (primarily effective on Unix-like systems).
21+
- For accurate results, run this script from a clean terminal session. Running inside IDEs/REPLs can inflate the
22+
process peak RSS before the script runs, making ru_maxrss appear very large with ~0 delta.
23+
"""
24+
25+
import argparse
26+
import os
27+
import sys
28+
import time
29+
import platform
30+
import resource
31+
from typing import Optional
32+
33+
try:
34+
from fast_langdetect import detect
35+
except Exception: # pragma: no cover
36+
# Support running from repo root without installation
37+
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
38+
from fast_langdetect import detect # type: ignore
39+
40+
41+
def set_address_space_limit(limit_mb: int | None) -> None:
42+
if limit_mb is None:
43+
return
44+
limit_bytes = int(limit_mb) * 1024 * 1024
45+
resource.setrlimit(resource.RLIMIT_AS, (limit_bytes, limit_bytes))
46+
47+
48+
def format_ru_maxrss_mb(val: int) -> float:
49+
"""Convert ru_maxrss to MB based on OS semantics.
50+
51+
- Linux: ru_maxrss is in kilobytes
52+
- macOS (Darwin): ru_maxrss is in bytes
53+
- BSDs often follow macOS/bytes; treat non-Linux as bytes by default
54+
"""
55+
system = platform.system()
56+
if system == "Linux":
57+
return val / 1024.0
58+
# Darwin, FreeBSD, etc.: assume bytes
59+
return val / (1024.0 * 1024.0)
60+
61+
62+
def current_rss_mb() -> Optional[float]:
63+
"""Return current RSS in MB if available; otherwise None.
64+
65+
Priority:
66+
1) psutil (if installed)
67+
2) /proc/self/status (Linux)
68+
"""
69+
try:
70+
import psutil # type: ignore
71+
72+
p = psutil.Process()
73+
return p.memory_info().rss / (1024.0 * 1024.0)
74+
except Exception:
75+
pass
76+
77+
if platform.system() == "Linux":
78+
try:
79+
with open("/proc/self/status", "r") as f:
80+
for line in f:
81+
if line.startswith("VmRSS:"):
82+
parts = line.split()
83+
# Example: VmRSS: 123456 kB
84+
if len(parts) >= 2:
85+
kb = float(parts[1])
86+
return kb / 1024.0
87+
except Exception:
88+
pass
89+
return None
90+
91+
92+
def main() -> int:
93+
parser = argparse.ArgumentParser(description="Check fast-langdetect memory usage and limits.")
94+
parser.add_argument("--model", choices=["lite", "full", "auto"], default="auto")
95+
parser.add_argument("--limit-mb", type=int, default=None, help="Set RLIMIT_AS in MB (Unix-like only)")
96+
parser.add_argument("--text", default="Hello world", help="Text to detect")
97+
parser.add_argument("--k", type=int, default=1, help="Top-k predictions")
98+
args = parser.parse_args()
99+
100+
set_address_space_limit(args.limit_mb)
101+
102+
print(f"Model: {args.model}")
103+
if args.limit_mb is not None:
104+
print(f"Address space limit: {args.limit_mb} MB")
105+
106+
peak_before = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
107+
curr_before = current_rss_mb()
108+
try:
109+
res = detect(args.text, model=args.model, k=args.k)
110+
except MemoryError:
111+
print("MemoryError: model load or inference exceeded limit.")
112+
return 2
113+
peak_after = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
114+
curr_after = current_rss_mb()
115+
116+
peak_used_mb = max(0.0, format_ru_maxrss_mb(peak_after) - format_ru_maxrss_mb(peak_before))
117+
peak_mb = format_ru_maxrss_mb(peak_after)
118+
119+
print(f"Result: {res}")
120+
print(f"Peak RSS (ru_maxrss): ~{peak_mb:.1f} MB")
121+
print(f"Approx. peak delta: ~{peak_used_mb:.1f} MB")
122+
if curr_before is not None and curr_after is not None:
123+
print(f"Current RSS before: ~{curr_before:.1f} MB; after: ~{curr_after:.1f} MB; delta: ~{(curr_after-curr_before):.1f} MB")
124+
else:
125+
print("Current RSS: psutil or /proc not available; showing peak only.")
126+
return 0
127+
128+
129+
if __name__ == "__main__":
130+
raise SystemExit(main())

0 commit comments

Comments
 (0)