Skip to content

Commit 76d7a5c

Browse files
authored
Chore/change lang detection logging level to avoid warning log spamming (#4078)
This PR changes the log line for defaulting short text to English to debug level. - this log is not because the logic failed or exception handling - short text can be common and we can get a lot of warning logs with the original code -> spams warning log and potentially cause user to miss other important warning level logs
1 parent 0d20f6a commit 76d7a5c

File tree

3 files changed

+13
-3
lines changed

3 files changed

+13
-3
lines changed

CHANGELOG.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## 0.18.14-dev0
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
9+
- **change short text language detection log to debug** reduce warning level log spamming
10+
111
## 0.18.13
212

313
### Enhancements
@@ -6,7 +16,7 @@
616

717
### Fixes
818

9-
- **Parse a wider variety of date formats in email headers** The `partition_email` function is now more robust to non-standard date formats, including ISO-8601 dates with "Z" suffixes. This prevents `ValueError` exceptions when partitioning emails with these date formats.
19+
- **Parse a wider variety of date formats in email headers** The `partition_email` function is now more robust to non-standard date formats, including ISO-8601 dates with "Z" suffixes. This prevents `ValueError` exceptions when partitioning emails with these date formats.
1020

1121
## 0.18.12
1222

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.18.13" # pragma: no cover
1+
__version__ = "0.18.14-dev0" # pragma: no cover

unstructured/partition/common/lang.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -403,7 +403,7 @@ def detect_languages(
403403
# If text contains special characters (like ñ, å, or Korean/Mandarin/etc.) it will NOT default
404404
# to English. It will default to English if text is only ascii characters and is short.
405405
if re.match(r"^[\x00-\x7F]+$", text) and len(text.split()) < 5:
406-
logger.warning(f'short text: "{text}". Defaulting to English.')
406+
logger.debug(f'short text: "{text}". Defaulting to English.')
407407
return ["eng"]
408408

409409
# set seed for deterministic langdetect outputs

0 commit comments

Comments
 (0)