Skip to content

Commit cc635c9

Browse files
⚡️ Speed up function under_non_alpha_ratio by 76% (#4079)
### 📄 76% (0.76x) speedup for ***`under_non_alpha_ratio` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`9.53 milliseconds`** **→** **`5.41 milliseconds`** (best of `91` runs) ### 📝 Explanation and details Here's an optimized version of your function. Major improvements. - Only **one pass** through the text string instead of two list comprehensions (saves a ton of memory and CPU). - No lists are constructed, only simple integer counters. - `char.strip()` is only used to check for non-space; you can check explicitly for that. Here's the optimized code with all original comments retained. This approach processes the string only **once** and uses **O(1) memory** (just two ints). The use of `char.isspace()` is a fast way to check for all Unicode whitespace, just as before. This will significantly speed up your function and eliminate almost all time spent in the original two list comprehensions. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **21 Passed** | | 🌀 Generated Regression Tests | ✅ **80 Passed** | | ⏪ Replay Tests | ✅ **594 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_under_non_alpha_ratio_zero_divide` | 1.14μs | 991ns | ✅15.1% | | `test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_under_non_alpha_ratio` | 820μs | 412μs | ✅98.8% | | `test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_under_non_alpha_ratio` | 5.62ms | 3.21ms | ✅75.3% | | `test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_under_non_alpha_ratio` | 1.95ms | 1.09ms | ✅79.2% | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.text_type import under_non_alpha_ratio # unit tests # ------------------------- # BASIC TEST CASES # ------------------------- def test_all_alpha_below_threshold(): # All alphabetic, so ratio is 1.0, which is not < threshold (default 0.5) codeflash_output = not under_non_alpha_ratio("HelloWorld") # 2.48μs -> 1.63μs (52.1% faster) def test_all_alpha_above_threshold(): # All alphabetic, but threshold is 1.1, so ratio is < threshold codeflash_output = under_non_alpha_ratio("HelloWorld", threshold=1.1) # 2.19μs -> 1.47μs (49.1% faster) def test_all_non_alpha(): # All non-alpha, so ratio is 0, which is < threshold codeflash_output = under_non_alpha_ratio("1234567890!@#$%^&*()_+-=[]{}|;':,.<>/?", threshold=0.5) # 3.38μs -> 2.00μs (68.8% faster) def test_mixed_alpha_non_alpha_below_threshold(): # 4 alpha, 6 non-alpha (excluding spaces): ratio = 4/10 = 0.4 < 0.5 codeflash_output = under_non_alpha_ratio("a1b2c3d4!!", threshold=0.5) # 2.16μs -> 1.44μs (50.7% faster) def test_mixed_alpha_non_alpha_above_threshold(): # 6 alpha, 2 non-alpha: ratio = 6/8 = 0.75 > 0.5, so not under threshold codeflash_output = not under_non_alpha_ratio("abCD12ef", threshold=0.5) # 2.04μs -> 1.43μs (42.9% faster) def test_spaces_are_ignored(): # Only 'a', 'b', 'c', '1', '2', '3' are counted (spaces ignored) # 3 alpha, 3 non-alpha: ratio = 3/6 = 0.5, not < threshold codeflash_output = not under_non_alpha_ratio("a b c 1 2 3", threshold=0.5) # 2.16μs -> 1.39μs (55.3% faster) # If threshold is 0.6, ratio 0.5 < 0.6, so True codeflash_output = under_non_alpha_ratio("a b c 1 2 3", threshold=0.6) # 1.25μs -> 705ns (77.7% faster) def test_threshold_edge_case_exact(): # 2 alpha, 2 non-alpha: ratio = 2/4 = 0.5, not < threshold codeflash_output = not under_non_alpha_ratio("a1b2", threshold=0.5) # 1.76μs -> 1.26μs (39.2% faster) # If threshold is 0.51, ratio 0.5 < 0.51, so True codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.51) # 781ns -> 541ns (44.4% faster) # ------------------------- # EDGE TEST CASES # ------------------------- def test_empty_string(): # Empty string should always return False codeflash_output = under_non_alpha_ratio("") # 450ns -> 379ns (18.7% faster) def test_only_spaces(): # Only spaces, so total_count == 0, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.24μs -> 822ns (50.9% faster) def test_only_newlines_and_tabs(): # Only whitespace, so total_count == 0, should return False codeflash_output = under_non_alpha_ratio("\n\t \t") # 1.16μs -> 745ns (55.7% faster) def test_only_one_alpha(): # Single alpha, total_count == 1, ratio == 1.0 codeflash_output = not under_non_alpha_ratio("A") # 1.43μs -> 1.06μs (35.5% faster) codeflash_output = under_non_alpha_ratio("A", threshold=1.1) # 689ns -> 589ns (17.0% faster) def test_only_one_non_alpha(): # Single non-alpha, total_count == 1, ratio == 0.0 codeflash_output = under_non_alpha_ratio("1") # 1.27μs -> 937ns (35.4% faster) codeflash_output = under_non_alpha_ratio("!", threshold=0.1) # 775ns -> 550ns (40.9% faster) def test_unicode_alpha_and_non_alpha(): # Unicode alpha: 'é', 'ü', 'ß' are isalpha() # Unicode non-alpha: '1', '!', '。' # 3 alpha, 3 non-alpha, ratio = 0.5 codeflash_output = not under_non_alpha_ratio("éüß1!。", threshold=0.5) # 3.03μs -> 2.21μs (37.3% faster) codeflash_output = under_non_alpha_ratio("éüß1!。", threshold=0.6) # 1.11μs -> 789ns (40.2% faster) def test_mixed_with_whitespace(): # Alpha: a, b, c; Non-alpha: 1, 2, 3; Spaces ignored codeflash_output = under_non_alpha_ratio(" a 1 b 2 c 3 ", threshold=0.6) # 2.46μs -> 1.60μs (54.0% faster) def test_threshold_zero(): # Any non-zero alpha ratio is not < 0, so always False unless all non-alpha codeflash_output = not under_non_alpha_ratio("abc123", threshold=0.0) # 2.19μs -> 1.49μs (46.5% faster) # All non-alpha: ratio = 0, not < 0, so False codeflash_output = not under_non_alpha_ratio("123", threshold=0.0) # 859ns -> 594ns (44.6% faster) def test_threshold_one(): # Any ratio < 1.0 should return True if not all alpha codeflash_output = under_non_alpha_ratio("abc123", threshold=1.0) # 1.94μs -> 1.24μs (56.4% faster) # All alpha: ratio = 1.0, not < 1.0, so False codeflash_output = not under_non_alpha_ratio("abcdef", threshold=1.0) # 1.17μs -> 602ns (93.7% faster) def test_leading_trailing_whitespace(): # Whitespace should be ignored codeflash_output = under_non_alpha_ratio(" a1b2c3 ", threshold=0.6) # 2.32μs -> 1.44μs (61.3% faster) def test_only_symbols(): # Only symbols, ratio = 0, so < threshold codeflash_output = under_non_alpha_ratio("!@#$%^&*", threshold=0.5) # 1.89μs -> 1.15μs (65.0% faster) def test_long_string_all_spaces_and_newlines(): # All whitespace, should return False codeflash_output = under_non_alpha_ratio(" \n " * 100) # 11.4μs -> 3.65μs (213% faster) def test_single_space(): # Single space, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.04μs -> 741ns (39.8% faster) def test_non_ascii_non_alpha(): # Non-ASCII, non-alpha (emoji) codeflash_output = under_non_alpha_ratio("😀😀😀", threshold=0.5) # 2.42μs -> 1.79μs (34.9% faster) def test_mixed_emojis_and_alpha(): # 2 alpha, 2 emoji: ratio = 2/4 = 0.5 codeflash_output = not under_non_alpha_ratio("a😀b😀", threshold=0.5) # 2.24μs -> 1.50μs (49.3% faster) codeflash_output = under_non_alpha_ratio("a😀b😀", threshold=0.6) # 948ns -> 658ns (44.1% faster) # ------------------------- # LARGE SCALE TEST CASES # ------------------------- def test_large_all_alpha(): # 1000 alpha, ratio = 1.0 s = "a" * 1000 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 47.9μs -> 33.4μs (43.4% faster) def test_large_all_non_alpha(): # 1000 non-alpha, ratio = 0.0 s = "1" * 1000 codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 40.9μs -> 24.9μs (64.3% faster) def test_large_mixed_half_and_half(): # 500 alpha, 500 non-alpha, ratio = 0.5 s = "a" * 500 + "1" * 500 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 45.2μs -> 28.4μs (59.0% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 43.3μs -> 27.8μs (55.9% faster) def test_large_with_spaces_ignored(): # 400 alpha, 400 non-alpha, 200 spaces (should be ignored) s = "a" * 400 + "1" * 400 + " " * 200 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 43.5μs -> 24.5μs (77.4% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 41.8μs -> 23.8μs (75.6% faster) def test_large_unicode_mixed(): # 300 unicode alpha, 300 unicode non-alpha, 400 ascii alpha s = "é" * 300 + "😀" * 300 + "a" * 400 # alpha: 300 (é) + 400 (a) = 700, non-alpha: 300 (😀), total = 1000 codeflash_output = not under_non_alpha_ratio(s, threshold=0.8) # 65.3μs -> 36.8μs (77.6% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.71) # 59.0μs -> 34.8μs (69.8% faster) # ratio = 700/1000 = 0.7 def test_large_threshold_zero_one(): # All alpha, threshold=0.0, should be False s = "b" * 999 codeflash_output = not under_non_alpha_ratio(s, threshold=0.0) # 46.2μs -> 33.1μs (39.4% faster) # All non-alpha, threshold=1.0, should be True s = "!" * 999 codeflash_output = under_non_alpha_ratio(s, threshold=1.0) # 40.4μs -> 24.3μs (66.0% faster) def test_large_string_with_whitespace_only(): # 1000 spaces, should return False s = " " * 1000 codeflash_output = under_non_alpha_ratio(s) # 33.7μs -> 10.0μs (236% faster) def test_large_string_with_mixed_whitespace_and_chars(): # 333 alpha, 333 non-alpha, 334 whitespace (ignored) s = "a" * 333 + "1" * 333 + " " * 334 # total_count = 666, alpha = 333, ratio = 0.5 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 41.2μs -> 21.8μs (89.4% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.51) # 40.1μs -> 21.0μs (90.8% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.text_type import under_non_alpha_ratio # unit tests # --- Basic Test Cases --- def test_all_alpha_default_threshold(): # All alphabetic, should be False (ratio = 1.0, not under 0.5) codeflash_output = under_non_alpha_ratio("HelloWorld") # 3.09μs -> 2.04μs (51.3% faster) def test_all_non_alpha_default_threshold(): # All non-alpha (punctuation), should be True (ratio = 0.0) codeflash_output = under_non_alpha_ratio("!!!???---") # 2.02μs -> 1.24μs (63.3% faster) def test_mixed_alpha_non_alpha_default_threshold(): # 5 alpha, 5 non-alpha, ratio = 0.5, should be False (not under threshold) codeflash_output = under_non_alpha_ratio("abc12!@#de") # 2.20μs -> 1.37μs (61.1% faster) def test_mixed_alpha_non_alpha_just_under_threshold(): # 2 alpha, 3 non-alpha, ratio = 0.4, should be True (under threshold) codeflash_output = under_non_alpha_ratio("a1!b2") # 1.67μs -> 1.19μs (40.9% faster) def test_spaces_are_ignored(): # Spaces should not count toward total_count # 3 alpha, 2 non-alpha, 2 spaces; ratio = 3/5 = 0.6, should be False codeflash_output = under_non_alpha_ratio("a b! c?") # 1.88μs -> 1.19μs (57.4% faster) def test_threshold_parameter(): # 2 alpha, 3 non-alpha, total=5, ratio=0.4 # threshold=0.3 -> False (not under), threshold=0.5 -> True (under) codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.3) # 1.71μs -> 1.29μs (32.4% faster) codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.5) # 900ns -> 536ns (67.9% faster) # --- Edge Test Cases --- def test_empty_string(): # Empty string should return False codeflash_output = under_non_alpha_ratio("") # 425ns -> 367ns (15.8% faster) def test_only_spaces(): # Only spaces, total_count=0, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.16μs -> 776ns (49.9% faster) def test_only_alpha_with_spaces(): # Only alpha and spaces, ratio=1.0, should return False codeflash_output = under_non_alpha_ratio("a b c d e") # 1.79μs -> 1.24μs (44.3% faster) def test_only_non_alpha_with_spaces(): # Only non-alpha and spaces, ratio=0.0, should return True codeflash_output = under_non_alpha_ratio("! @ # $ %") # 1.73μs -> 1.08μs (59.6% faster) def test_single_alpha(): # Single alpha, ratio=1.0, should return False codeflash_output = under_non_alpha_ratio("A") # 1.38μs -> 1.03μs (33.9% faster) def test_single_non_alpha(): # Single non-alpha, ratio=0.0, should return True codeflash_output = under_non_alpha_ratio("?") # 1.23μs -> 978ns (26.1% faster) def test_single_space(): # Single space, total_count=0, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.00μs -> 729ns (37.7% faster) def test_all_digits(): # All digits, ratio=0.0, should return True codeflash_output = under_non_alpha_ratio("1234567890") # 2.00μs -> 1.21μs (65.4% faster) def test_unicode_alpha(): # Unicode alphabetic characters (e.g. accented letters) # 3 alpha, 2 non-alpha, ratio=0.6, should be False codeflash_output = under_non_alpha_ratio("éàü!!") # 2.19μs -> 1.58μs (39.3% faster) def test_unicode_non_alpha(): # Unicode non-alpha (emoji, symbols) # 2 non-alpha, 2 alpha, ratio=0.5, should be False codeflash_output = under_non_alpha_ratio("a😀b!") # 2.46μs -> 1.69μs (45.9% faster) def test_threshold_1_0(): # threshold=1.0, any string with <100% alpha should return True # 2 alpha, 2 non-alpha, ratio=0.5 < 1.0 codeflash_output = under_non_alpha_ratio("a1b2", threshold=1.0) # 1.73μs -> 1.36μs (27.4% faster) def test_threshold_0_0(): # threshold=0.0, only strings with 0% alpha should return True # 2 alpha, 2 non-alpha, ratio=0.5 > 0.0 codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.0) # 1.73μs -> 1.29μs (33.7% faster) # All non-alpha, ratio=0.0 == 0.0, should be False (not under threshold) codeflash_output = under_non_alpha_ratio("1234", threshold=0.0) # 883ns -> 653ns (35.2% faster) def test_threshold_exactly_equal(): # Ratio equals threshold: should return False (not under threshold) # 2 alpha, 2 non-alpha, ratio=0.5 == threshold codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.5) # 1.60μs -> 1.20μs (33.1% faster) def test_tabs_and_newlines_ignored(): # Tabs and newlines are whitespace, so ignored # 2 alpha, 2 non-alpha, 2 whitespace, ratio=2/4=0.5, should be False codeflash_output = under_non_alpha_ratio("a\tb\n!") # 1.80μs -> 1.16μs (55.6% faster) def test_long_repeated_pattern(): # 500 alpha, 500 non-alpha, ratio=0.5, should be False s = "a1" * 500 codeflash_output = under_non_alpha_ratio(s) # 45.3μs -> 29.1μs (55.5% faster) # 499 alpha, 501 non-alpha, ratio=499/1000=0.499, should be True s2 = "a1" * 499 + "1!" codeflash_output = under_non_alpha_ratio(s2) # 43.8μs -> 28.0μs (56.3% faster) # --- Large Scale Test Cases --- def test_large_all_alpha(): # 1000 alphabetic characters, ratio=1.0, should be False s = "a" * 1000 codeflash_output = under_non_alpha_ratio(s) # 46.5μs -> 32.9μs (41.2% faster) def test_large_all_non_alpha(): # 1000 non-alpha characters, ratio=0.0, should be True s = "!" * 1000 codeflash_output = under_non_alpha_ratio(s) # 41.5μs -> 24.6μs (68.4% faster) def test_large_half_alpha_half_non_alpha(): # 500 alpha, 500 non-alpha, ratio=0.5, should be False s = ("a!" * 500) codeflash_output = under_non_alpha_ratio(s) # 44.6μs -> 28.6μs (56.1% faster) def test_large_sparse_alpha(): # 10 alpha, 990 non-alpha, ratio=0.01, should be True s = "a" + "!" * 99 s = s * 10 # 10 alpha, 990 non-alpha codeflash_output = under_non_alpha_ratio(s) # 41.4μs -> 25.0μs (65.6% faster) def test_large_sparse_non_alpha(): # 990 alpha, 10 non-alpha, ratio=0.99, should be False s = "a" * 99 + "!" # 99 alpha, 1 non-alpha s = s * 10 # 990 alpha, 10 non-alpha codeflash_output = under_non_alpha_ratio(s) # 46.2μs -> 33.0μs (39.9% faster) def test_large_with_spaces(): # 500 alpha, 500 non-alpha, 100 spaces (should be ignored) s = ("a!" * 500) + (" " * 100) codeflash_output = under_non_alpha_ratio(s) # 47.7μs -> 29.7μs (60.6% faster) def test_large_thresholds(): # 600 alpha, 400 non-alpha, ratio=0.6 s = "a" * 600 + "!" * 400 codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 44.5μs -> 29.4μs (51.1% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.7) # 42.6μs -> 28.5μs (49.4% faster) # --- Additional Robustness Tests --- def test_mixed_case_and_symbols(): # Mixed uppercase, lowercase, digits, symbols # 3 alpha, 3 non-alpha, ratio=0.5, should be False codeflash_output = under_non_alpha_ratio("A1b2C!") # 1.79μs -> 1.17μs (53.7% faster) def test_realistic_sentence(): # Realistic sentence, mostly alpha, some punctuation # 20 alpha, 2 non-alpha (comma, period), ratio=20/22 ~ 0.909, should be False codeflash_output = under_non_alpha_ratio("Hello, this is a test sentence.") # 3.39μs -> 1.94μs (74.3% faster) def test_realistic_break_line(): # Typical break line, mostly non-alpha # 1 alpha, 9 non-alpha, ratio=0.1, should be True codeflash_output = under_non_alpha_ratio("----BREAK----") # 2.08μs -> 1.32μs (57.9% faster) def test_space_heavy_string(): # Spaces should be ignored, only non-space chars count # 2 alpha, 2 non-alpha, 10 spaces, ratio=2/4=0.5, should be False codeflash_output = under_non_alpha_ratio(" a ! b ? ") # 2.25μs -> 1.29μs (74.3% faster) def test_only_whitespace_variety(): # Only tabs, spaces, newlines, should return False codeflash_output = under_non_alpha_ratio(" \t\n\r") # 1.08μs -> 714ns (51.4% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-under_non_alpha_ratio-mcgm6dor` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai) --------- Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
1 parent 76d7a5c commit cc635c9

File tree

2 files changed

+12
-3
lines changed

2 files changed

+12
-3
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22

33
### Enhancements
44

5+
- Speed up function `under_non_alpha_ratio` by 76% (codeflash)
6+
57
### Features
68

79
### Fixes
810

911
- **change short text language detection log to debug** reduce warning level log spamming
1012

13+
1114
## 0.18.13
1215

1316
### Enhancements

unstructured/partition/text_type.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -245,11 +245,17 @@ def under_non_alpha_ratio(text: str, threshold: float = 0.5):
245245
If the proportion of non-alpha characters exceeds this threshold, the function
246246
returns False
247247
"""
248-
if len(text) == 0:
248+
if not text:
249249
return False
250250

251-
alpha_count = len([char for char in text if char.strip() and char.isalpha()])
252-
total_count = len([char for char in text if char.strip()])
251+
alpha_count = 0
252+
total_count = 0
253+
for char in text:
254+
if not char.isspace():
255+
total_count += 1
256+
if char.isalpha():
257+
alpha_count += 1
258+
253259
return ((alpha_count / total_count) < threshold) if total_count > 0 else False
254260

255261

0 commit comments

Comments
 (0)