- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.1k
Commit cc635c9
⚡️ Speed up function 
### 📄 76% (0.76x) speedup for ***`under_non_alpha_ratio` in
`unstructured/partition/text_type.py`***
⏱️ Runtime : **`9.53 milliseconds`** **→** **`5.41 milliseconds`** (best
of `91` runs)
### 📝 Explanation and details
Here's an optimized version of your function.  
Major improvements.
- Only **one pass** through the text string instead of two list
comprehensions (saves a ton of memory and CPU).
- No lists are constructed, only simple integer counters.
- `char.strip()` is only used to check for non-space; you can check
explicitly for that.
Here's the optimized code with all original comments retained.
This approach processes the string only **once** and uses **O(1)
memory** (just two ints). The use of `char.isspace()` is a fast way to
check for all Unicode whitespace, just as before. This will
significantly speed up your function and eliminate almost all time spent
in the original two list comprehensions.
✅ **Correctness verification report:**
| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **21 Passed** |
| 🌀 Generated Regression Tests | ✅ **80 Passed** |
| ⏪ Replay Tests | ✅ **594 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_text_type.py::test_under_non_alpha_ratio_zero_divide`
| 1.14μs | 991ns | ✅15.1% |
|
`test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 820μs | 412μs | ✅98.8% |
|
`test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 5.62ms | 3.21ms | ✅75.3% |
|
`test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 1.95ms | 1.09ms | ✅79.2% |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
from __future__ import annotations
# imports
import pytest  # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio
# unit tests
# -------------------------
# BASIC TEST CASES
# -------------------------
def test_all_alpha_below_threshold():
    # All alphabetic, so ratio is 1.0, which is not < threshold (default 0.5)
    codeflash_output = not under_non_alpha_ratio("HelloWorld") # 2.48μs -> 1.63μs (52.1% faster)
def test_all_alpha_above_threshold():
    # All alphabetic, but threshold is 1.1, so ratio is < threshold
    codeflash_output = under_non_alpha_ratio("HelloWorld", threshold=1.1) # 2.19μs -> 1.47μs (49.1% faster)
def test_all_non_alpha():
    # All non-alpha, so ratio is 0, which is < threshold
    codeflash_output = under_non_alpha_ratio("1234567890!@#$%^&*()_+-=[]{}|;':,.<>/?", threshold=0.5) # 3.38μs -> 2.00μs (68.8% faster)
def test_mixed_alpha_non_alpha_below_threshold():
    # 4 alpha, 6 non-alpha (excluding spaces): ratio = 4/10 = 0.4 < 0.5
    codeflash_output = under_non_alpha_ratio("a1b2c3d4!!", threshold=0.5) # 2.16μs -> 1.44μs (50.7% faster)
def test_mixed_alpha_non_alpha_above_threshold():
    # 6 alpha, 2 non-alpha: ratio = 6/8 = 0.75 > 0.5, so not under threshold
    codeflash_output = not under_non_alpha_ratio("abCD12ef", threshold=0.5) # 2.04μs -> 1.43μs (42.9% faster)
def test_spaces_are_ignored():
    # Only 'a', 'b', 'c', '1', '2', '3' are counted (spaces ignored)
    # 3 alpha, 3 non-alpha: ratio = 3/6 = 0.5, not < threshold
    codeflash_output = not under_non_alpha_ratio("a b c 1 2 3", threshold=0.5) # 2.16μs -> 1.39μs (55.3% faster)
    # If threshold is 0.6, ratio 0.5 < 0.6, so True
    codeflash_output = under_non_alpha_ratio("a b c 1 2 3", threshold=0.6) # 1.25μs -> 705ns (77.7% faster)
def test_threshold_edge_case_exact():
    # 2 alpha, 2 non-alpha: ratio = 2/4 = 0.5, not < threshold
    codeflash_output = not under_non_alpha_ratio("a1b2", threshold=0.5) # 1.76μs -> 1.26μs (39.2% faster)
    # If threshold is 0.51, ratio 0.5 < 0.51, so True
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.51) # 781ns -> 541ns (44.4% faster)
# -------------------------
# EDGE TEST CASES
# -------------------------
def test_empty_string():
    # Empty string should always return False
    codeflash_output = under_non_alpha_ratio("") # 450ns -> 379ns (18.7% faster)
def test_only_spaces():
    # Only spaces, so total_count == 0, should return False
    codeflash_output = under_non_alpha_ratio("     ") # 1.24μs -> 822ns (50.9% faster)
def test_only_newlines_and_tabs():
    # Only whitespace, so total_count == 0, should return False
    codeflash_output = under_non_alpha_ratio("\n\t  \t") # 1.16μs -> 745ns (55.7% faster)
def test_only_one_alpha():
    # Single alpha, total_count == 1, ratio == 1.0
    codeflash_output = not under_non_alpha_ratio("A") # 1.43μs -> 1.06μs (35.5% faster)
    codeflash_output = under_non_alpha_ratio("A", threshold=1.1) # 689ns -> 589ns (17.0% faster)
def test_only_one_non_alpha():
    # Single non-alpha, total_count == 1, ratio == 0.0
    codeflash_output = under_non_alpha_ratio("1") # 1.27μs -> 937ns (35.4% faster)
    codeflash_output = under_non_alpha_ratio("!", threshold=0.1) # 775ns -> 550ns (40.9% faster)
def test_unicode_alpha_and_non_alpha():
    # Unicode alpha: 'é', 'ü', 'ß' are isalpha()
    # Unicode non-alpha: '1', '!', '。'
    # 3 alpha, 3 non-alpha, ratio = 0.5
    codeflash_output = not under_non_alpha_ratio("éüß1!。", threshold=0.5) # 3.03μs -> 2.21μs (37.3% faster)
    codeflash_output = under_non_alpha_ratio("éüß1!。", threshold=0.6) # 1.11μs -> 789ns (40.2% faster)
def test_mixed_with_whitespace():
    # Alpha: a, b, c; Non-alpha: 1, 2, 3; Spaces ignored
    codeflash_output = under_non_alpha_ratio(" a 1 b 2 c 3 ", threshold=0.6) # 2.46μs -> 1.60μs (54.0% faster)
def test_threshold_zero():
    # Any non-zero alpha ratio is not < 0, so always False unless all non-alpha
    codeflash_output = not under_non_alpha_ratio("abc123", threshold=0.0) # 2.19μs -> 1.49μs (46.5% faster)
    # All non-alpha: ratio = 0, not < 0, so False
    codeflash_output = not under_non_alpha_ratio("123", threshold=0.0) # 859ns -> 594ns (44.6% faster)
def test_threshold_one():
    # Any ratio < 1.0 should return True if not all alpha
    codeflash_output = under_non_alpha_ratio("abc123", threshold=1.0) # 1.94μs -> 1.24μs (56.4% faster)
    # All alpha: ratio = 1.0, not < 1.0, so False
    codeflash_output = not under_non_alpha_ratio("abcdef", threshold=1.0) # 1.17μs -> 602ns (93.7% faster)
def test_leading_trailing_whitespace():
    # Whitespace should be ignored
    codeflash_output = under_non_alpha_ratio("   a1b2c3   ", threshold=0.6) # 2.32μs -> 1.44μs (61.3% faster)
def test_only_symbols():
    # Only symbols, ratio = 0, so < threshold
    codeflash_output = under_non_alpha_ratio("!@#$%^&*", threshold=0.5) # 1.89μs -> 1.15μs (65.0% faster)
def test_long_string_all_spaces_and_newlines():
    # All whitespace, should return False
    codeflash_output = under_non_alpha_ratio(" \n " * 100) # 11.4μs -> 3.65μs (213% faster)
def test_single_space():
    # Single space, should return False
    codeflash_output = under_non_alpha_ratio(" ") # 1.04μs -> 741ns (39.8% faster)
def test_non_ascii_non_alpha():
    # Non-ASCII, non-alpha (emoji)
    codeflash_output = under_non_alpha_ratio("😀😀😀", threshold=0.5) # 2.42μs -> 1.79μs (34.9% faster)
def test_mixed_emojis_and_alpha():
    # 2 alpha, 2 emoji: ratio = 2/4 = 0.5
    codeflash_output = not under_non_alpha_ratio("a😀b😀", threshold=0.5) # 2.24μs -> 1.50μs (49.3% faster)
    codeflash_output = under_non_alpha_ratio("a😀b😀", threshold=0.6) # 948ns -> 658ns (44.1% faster)
# -------------------------
# LARGE SCALE TEST CASES
# -------------------------
def test_large_all_alpha():
    # 1000 alpha, ratio = 1.0
    s = "a" * 1000
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 47.9μs -> 33.4μs (43.4% faster)
def test_large_all_non_alpha():
    # 1000 non-alpha, ratio = 0.0
    s = "1" * 1000
    codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 40.9μs -> 24.9μs (64.3% faster)
def test_large_mixed_half_and_half():
    # 500 alpha, 500 non-alpha, ratio = 0.5
    s = "a" * 500 + "1" * 500
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 45.2μs -> 28.4μs (59.0% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 43.3μs -> 27.8μs (55.9% faster)
def test_large_with_spaces_ignored():
    # 400 alpha, 400 non-alpha, 200 spaces (should be ignored)
    s = "a" * 400 + "1" * 400 + " " * 200
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 43.5μs -> 24.5μs (77.4% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 41.8μs -> 23.8μs (75.6% faster)
def test_large_unicode_mixed():
    # 300 unicode alpha, 300 unicode non-alpha, 400 ascii alpha
    s = "é" * 300 + "😀" * 300 + "a" * 400
    # alpha: 300 (é) + 400 (a) = 700, non-alpha: 300 (😀), total = 1000
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.8) # 65.3μs -> 36.8μs (77.6% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.71) # 59.0μs -> 34.8μs (69.8% faster)
    # ratio = 700/1000 = 0.7
def test_large_threshold_zero_one():
    # All alpha, threshold=0.0, should be False
    s = "b" * 999
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.0) # 46.2μs -> 33.1μs (39.4% faster)
    # All non-alpha, threshold=1.0, should be True
    s = "!" * 999
    codeflash_output = under_non_alpha_ratio(s, threshold=1.0) # 40.4μs -> 24.3μs (66.0% faster)
def test_large_string_with_whitespace_only():
    # 1000 spaces, should return False
    s = " " * 1000
    codeflash_output = under_non_alpha_ratio(s) # 33.7μs -> 10.0μs (236% faster)
def test_large_string_with_mixed_whitespace_and_chars():
    # 333 alpha, 333 non-alpha, 334 whitespace (ignored)
    s = "a" * 333 + "1" * 333 + " " * 334
    # total_count = 666, alpha = 333, ratio = 0.5
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 41.2μs -> 21.8μs (89.4% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.51) # 40.1μs -> 21.0μs (90.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations
# imports
import pytest  # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio
# unit tests
# --- Basic Test Cases ---
def test_all_alpha_default_threshold():
    # All alphabetic, should be False (ratio = 1.0, not under 0.5)
    codeflash_output = under_non_alpha_ratio("HelloWorld") # 3.09μs -> 2.04μs (51.3% faster)
def test_all_non_alpha_default_threshold():
    # All non-alpha (punctuation), should be True (ratio = 0.0)
    codeflash_output = under_non_alpha_ratio("!!!???---") # 2.02μs -> 1.24μs (63.3% faster)
def test_mixed_alpha_non_alpha_default_threshold():
    # 5 alpha, 5 non-alpha, ratio = 0.5, should be False (not under threshold)
    codeflash_output = under_non_alpha_ratio("abc12!@#de") # 2.20μs -> 1.37μs (61.1% faster)
def test_mixed_alpha_non_alpha_just_under_threshold():
    # 2 alpha, 3 non-alpha, ratio = 0.4, should be True (under threshold)
    codeflash_output = under_non_alpha_ratio("a1!b2") # 1.67μs -> 1.19μs (40.9% faster)
def test_spaces_are_ignored():
    # Spaces should not count toward total_count
    # 3 alpha, 2 non-alpha, 2 spaces; ratio = 3/5 = 0.6, should be False
    codeflash_output = under_non_alpha_ratio("a b! c?") # 1.88μs -> 1.19μs (57.4% faster)
def test_threshold_parameter():
    # 2 alpha, 3 non-alpha, total=5, ratio=0.4
    # threshold=0.3 -> False (not under), threshold=0.5 -> True (under)
    codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.3) # 1.71μs -> 1.29μs (32.4% faster)
    codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.5) # 900ns -> 536ns (67.9% faster)
# --- Edge Test Cases ---
def test_empty_string():
    # Empty string should return False
    codeflash_output = under_non_alpha_ratio("") # 425ns -> 367ns (15.8% faster)
def test_only_spaces():
    # Only spaces, total_count=0, should return False
    codeflash_output = under_non_alpha_ratio("     ") # 1.16μs -> 776ns (49.9% faster)
def test_only_alpha_with_spaces():
    # Only alpha and spaces, ratio=1.0, should return False
    codeflash_output = under_non_alpha_ratio("a b c d e") # 1.79μs -> 1.24μs (44.3% faster)
def test_only_non_alpha_with_spaces():
    # Only non-alpha and spaces, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("! @ # $ %") # 1.73μs -> 1.08μs (59.6% faster)
def test_single_alpha():
    # Single alpha, ratio=1.0, should return False
    codeflash_output = under_non_alpha_ratio("A") # 1.38μs -> 1.03μs (33.9% faster)
def test_single_non_alpha():
    # Single non-alpha, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("?") # 1.23μs -> 978ns (26.1% faster)
def test_single_space():
    # Single space, total_count=0, should return False
    codeflash_output = under_non_alpha_ratio(" ") # 1.00μs -> 729ns (37.7% faster)
def test_all_digits():
    # All digits, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("1234567890") # 2.00μs -> 1.21μs (65.4% faster)
def test_unicode_alpha():
    # Unicode alphabetic characters (e.g. accented letters)
    # 3 alpha, 2 non-alpha, ratio=0.6, should be False
    codeflash_output = under_non_alpha_ratio("éàü!!") # 2.19μs -> 1.58μs (39.3% faster)
def test_unicode_non_alpha():
    # Unicode non-alpha (emoji, symbols)
    # 2 non-alpha, 2 alpha, ratio=0.5, should be False
    codeflash_output = under_non_alpha_ratio("a😀b!") # 2.46μs -> 1.69μs (45.9% faster)
def test_threshold_1_0():
    # threshold=1.0, any string with <100% alpha should return True
    # 2 alpha, 2 non-alpha, ratio=0.5 < 1.0
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=1.0) # 1.73μs -> 1.36μs (27.4% faster)
def test_threshold_0_0():
    # threshold=0.0, only strings with 0% alpha should return True
    # 2 alpha, 2 non-alpha, ratio=0.5 > 0.0
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.0) # 1.73μs -> 1.29μs (33.7% faster)
    # All non-alpha, ratio=0.0 == 0.0, should be False (not under threshold)
    codeflash_output = under_non_alpha_ratio("1234", threshold=0.0) # 883ns -> 653ns (35.2% faster)
def test_threshold_exactly_equal():
    # Ratio equals threshold: should return False (not under threshold)
    # 2 alpha, 2 non-alpha, ratio=0.5 == threshold
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.5) # 1.60μs -> 1.20μs (33.1% faster)
def test_tabs_and_newlines_ignored():
    # Tabs and newlines are whitespace, so ignored
    # 2 alpha, 2 non-alpha, 2 whitespace, ratio=2/4=0.5, should be False
    codeflash_output = under_non_alpha_ratio("a\tb\n!") # 1.80μs -> 1.16μs (55.6% faster)
def test_long_repeated_pattern():
    # 500 alpha, 500 non-alpha, ratio=0.5, should be False
    s = "a1" * 500
    codeflash_output = under_non_alpha_ratio(s) # 45.3μs -> 29.1μs (55.5% faster)
    # 499 alpha, 501 non-alpha, ratio=499/1000=0.499, should be True
    s2 = "a1" * 499 + "1!"
    codeflash_output = under_non_alpha_ratio(s2) # 43.8μs -> 28.0μs (56.3% faster)
# --- Large Scale Test Cases ---
def test_large_all_alpha():
    # 1000 alphabetic characters, ratio=1.0, should be False
    s = "a" * 1000
    codeflash_output = under_non_alpha_ratio(s) # 46.5μs -> 32.9μs (41.2% faster)
def test_large_all_non_alpha():
    # 1000 non-alpha characters, ratio=0.0, should be True
    s = "!" * 1000
    codeflash_output = under_non_alpha_ratio(s) # 41.5μs -> 24.6μs (68.4% faster)
def test_large_half_alpha_half_non_alpha():
    # 500 alpha, 500 non-alpha, ratio=0.5, should be False
    s = ("a!" * 500)
    codeflash_output = under_non_alpha_ratio(s) # 44.6μs -> 28.6μs (56.1% faster)
def test_large_sparse_alpha():
    # 10 alpha, 990 non-alpha, ratio=0.01, should be True
    s = "a" + "!" * 99
    s = s * 10  # 10 alpha, 990 non-alpha
    codeflash_output = under_non_alpha_ratio(s) # 41.4μs -> 25.0μs (65.6% faster)
def test_large_sparse_non_alpha():
    # 990 alpha, 10 non-alpha, ratio=0.99, should be False
    s = "a" * 99 + "!"  # 99 alpha, 1 non-alpha
    s = s * 10  # 990 alpha, 10 non-alpha
    codeflash_output = under_non_alpha_ratio(s) # 46.2μs -> 33.0μs (39.9% faster)
def test_large_with_spaces():
    # 500 alpha, 500 non-alpha, 100 spaces (should be ignored)
    s = ("a!" * 500) + (" " * 100)
    codeflash_output = under_non_alpha_ratio(s) # 47.7μs -> 29.7μs (60.6% faster)
def test_large_thresholds():
    # 600 alpha, 400 non-alpha, ratio=0.6
    s = "a" * 600 + "!" * 400
    codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 44.5μs -> 29.4μs (51.1% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.7) # 42.6μs -> 28.5μs (49.4% faster)
# --- Additional Robustness Tests ---
def test_mixed_case_and_symbols():
    # Mixed uppercase, lowercase, digits, symbols
    # 3 alpha, 3 non-alpha, ratio=0.5, should be False
    codeflash_output = under_non_alpha_ratio("A1b2C!") # 1.79μs -> 1.17μs (53.7% faster)
def test_realistic_sentence():
    # Realistic sentence, mostly alpha, some punctuation
    # 20 alpha, 2 non-alpha (comma, period), ratio=20/22 ~ 0.909, should be False
    codeflash_output = under_non_alpha_ratio("Hello, this is a test sentence.") # 3.39μs -> 1.94μs (74.3% faster)
def test_realistic_break_line():
    # Typical break line, mostly non-alpha
    # 1 alpha, 9 non-alpha, ratio=0.1, should be True
    codeflash_output = under_non_alpha_ratio("----BREAK----") # 2.08μs -> 1.32μs (57.9% faster)
def test_space_heavy_string():
    # Spaces should be ignored, only non-space chars count
    # 2 alpha, 2 non-alpha, 10 spaces, ratio=2/4=0.5, should be False
    codeflash_output = under_non_alpha_ratio(" a ! b ?          ") # 2.25μs -> 1.29μs (74.3% faster)
def test_only_whitespace_variety():
    # Only tabs, spaces, newlines, should return False
    codeflash_output = under_non_alpha_ratio(" \t\n\r") # 1.08μs -> 714ns (51.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
</details>
To edit these changes `git checkout
codeflash/optimize-under_non_alpha_ratio-mcgm6dor` and push.
[](https://codeflash.ai)
---------
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>under_non_alpha_ratio by 76% (#4079)1 parent 76d7a5c commit cc635c9Copy full SHA for cc635c9
File tree
Expand file treeCollapse file tree
2 files changed
+12
-3
lines changedOpen diff view settings
Filter options
- unstructured/partition
Expand file treeCollapse file tree
2 files changed
+12
-3
lines changedOpen diff view settings
Collapse file
+3Lines changed: 3 additions & 0 deletions
- Display the source diff
- Display the rich diff
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
|  | |||
| 2 | 2 |  | |
| 3 | 3 |  | |
| 4 | 4 |  | |
|  | 5 | + | |
|  | 6 | + | |
| 5 | 7 |  | |
| 6 | 8 |  | |
| 7 | 9 |  | |
| 8 | 10 |  | |
| 9 | 11 |  | |
| 10 | 12 |  | |
|  | 13 | + | |
| 11 | 14 |  | |
| 12 | 15 |  | |
| 13 | 16 |  | |
|  | |||
Collapse file
unstructured/partition/text_type.py
Copy file name to clipboardExpand all lines: unstructured/partition/text_type.py+9-3Lines changed: 9 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
|  | |||
| 245 | 245 |  | |
| 246 | 246 |  | |
| 247 | 247 |  | |
| 248 |  | - | |
|  | 248 | + | |
| 249 | 249 |  | |
| 250 | 250 |  | |
| 251 |  | - | |
| 252 |  | - | |
|  | 251 | + | |
|  | 252 | + | |
|  | 253 | + | |
|  | 254 | + | |
|  | 255 | + | |
|  | 256 | + | |
|  | 257 | + | |
|  | 258 | + | |
| 253 | 259 |  | |
| 254 | 260 |  | |
| 255 | 261 |  | |
|  | |||
0 commit comments