Skip to content

Conversation

@misrasaurabh1
Copy link
Contributor

📄 76% (0.76x) speedup for under_non_alpha_ratio in unstructured/partition/text_type.py

⏱️ Runtime : 9.53 milliseconds 5.41 milliseconds (best of 91 runs)

📝 Explanation and details

Here's an optimized version of your function.
Major improvements.

  • Only one pass through the text string instead of two list comprehensions (saves a ton of memory and CPU).
  • No lists are constructed, only simple integer counters.
  • char.strip() is only used to check for non-space; you can check explicitly for that.

Here's the optimized code with all original comments retained.

This approach processes the string only once and uses O(1) memory (just two ints). The use of char.isspace() is a fast way to check for all Unicode whitespace, just as before. This will significantly speed up your function and eliminate almost all time spent in the original two list comprehensions.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 21 Passed
🌀 Generated Regression Tests 80 Passed
⏪ Replay Tests 594 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/test_text_type.py::test_under_non_alpha_ratio_zero_divide 1.14μs 991ns ✅15.1%
test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_under_non_alpha_ratio 820μs 412μs ✅98.8%
test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_under_non_alpha_ratio 5.62ms 3.21ms ✅75.3%
test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_under_non_alpha_ratio 1.95ms 1.09ms ✅79.2%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio

# unit tests

# -------------------------
# BASIC TEST CASES
# -------------------------

def test_all_alpha_below_threshold():
    # All alphabetic, so ratio is 1.0, which is not < threshold (default 0.5)
    codeflash_output = not under_non_alpha_ratio("HelloWorld") # 2.48μs -> 1.63μs (52.1% faster)

def test_all_alpha_above_threshold():
    # All alphabetic, but threshold is 1.1, so ratio is < threshold
    codeflash_output = under_non_alpha_ratio("HelloWorld", threshold=1.1) # 2.19μs -> 1.47μs (49.1% faster)

def test_all_non_alpha():
    # All non-alpha, so ratio is 0, which is < threshold
    codeflash_output = under_non_alpha_ratio("1234567890!@#$%^&*()_+-=[]{}|;':,.<>/?", threshold=0.5) # 3.38μs -> 2.00μs (68.8% faster)

def test_mixed_alpha_non_alpha_below_threshold():
    # 4 alpha, 6 non-alpha (excluding spaces): ratio = 4/10 = 0.4 < 0.5
    codeflash_output = under_non_alpha_ratio("a1b2c3d4!!", threshold=0.5) # 2.16μs -> 1.44μs (50.7% faster)

def test_mixed_alpha_non_alpha_above_threshold():
    # 6 alpha, 2 non-alpha: ratio = 6/8 = 0.75 > 0.5, so not under threshold
    codeflash_output = not under_non_alpha_ratio("abCD12ef", threshold=0.5) # 2.04μs -> 1.43μs (42.9% faster)

def test_spaces_are_ignored():
    # Only 'a', 'b', 'c', '1', '2', '3' are counted (spaces ignored)
    # 3 alpha, 3 non-alpha: ratio = 3/6 = 0.5, not < threshold
    codeflash_output = not under_non_alpha_ratio("a b c 1 2 3", threshold=0.5) # 2.16μs -> 1.39μs (55.3% faster)
    # If threshold is 0.6, ratio 0.5 < 0.6, so True
    codeflash_output = under_non_alpha_ratio("a b c 1 2 3", threshold=0.6) # 1.25μs -> 705ns (77.7% faster)

def test_threshold_edge_case_exact():
    # 2 alpha, 2 non-alpha: ratio = 2/4 = 0.5, not < threshold
    codeflash_output = not under_non_alpha_ratio("a1b2", threshold=0.5) # 1.76μs -> 1.26μs (39.2% faster)
    # If threshold is 0.51, ratio 0.5 < 0.51, so True
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.51) # 781ns -> 541ns (44.4% faster)

# -------------------------
# EDGE TEST CASES
# -------------------------

def test_empty_string():
    # Empty string should always return False
    codeflash_output = under_non_alpha_ratio("") # 450ns -> 379ns (18.7% faster)

def test_only_spaces():
    # Only spaces, so total_count == 0, should return False
    codeflash_output = under_non_alpha_ratio("     ") # 1.24μs -> 822ns (50.9% faster)

def test_only_newlines_and_tabs():
    # Only whitespace, so total_count == 0, should return False
    codeflash_output = under_non_alpha_ratio("\n\t  \t") # 1.16μs -> 745ns (55.7% faster)

def test_only_one_alpha():
    # Single alpha, total_count == 1, ratio == 1.0
    codeflash_output = not under_non_alpha_ratio("A") # 1.43μs -> 1.06μs (35.5% faster)
    codeflash_output = under_non_alpha_ratio("A", threshold=1.1) # 689ns -> 589ns (17.0% faster)

def test_only_one_non_alpha():
    # Single non-alpha, total_count == 1, ratio == 0.0
    codeflash_output = under_non_alpha_ratio("1") # 1.27μs -> 937ns (35.4% faster)
    codeflash_output = under_non_alpha_ratio("!", threshold=0.1) # 775ns -> 550ns (40.9% faster)

def test_unicode_alpha_and_non_alpha():
    # Unicode alpha: 'é', 'ü', 'ß' are isalpha()
    # Unicode non-alpha: '1', '!', '。'
    # 3 alpha, 3 non-alpha, ratio = 0.5
    codeflash_output = not under_non_alpha_ratio("éüß1!。", threshold=0.5) # 3.03μs -> 2.21μs (37.3% faster)
    codeflash_output = under_non_alpha_ratio("éüß1!。", threshold=0.6) # 1.11μs -> 789ns (40.2% faster)

def test_mixed_with_whitespace():
    # Alpha: a, b, c; Non-alpha: 1, 2, 3; Spaces ignored
    codeflash_output = under_non_alpha_ratio(" a 1 b 2 c 3 ", threshold=0.6) # 2.46μs -> 1.60μs (54.0% faster)

def test_threshold_zero():
    # Any non-zero alpha ratio is not < 0, so always False unless all non-alpha
    codeflash_output = not under_non_alpha_ratio("abc123", threshold=0.0) # 2.19μs -> 1.49μs (46.5% faster)
    # All non-alpha: ratio = 0, not < 0, so False
    codeflash_output = not under_non_alpha_ratio("123", threshold=0.0) # 859ns -> 594ns (44.6% faster)

def test_threshold_one():
    # Any ratio < 1.0 should return True if not all alpha
    codeflash_output = under_non_alpha_ratio("abc123", threshold=1.0) # 1.94μs -> 1.24μs (56.4% faster)
    # All alpha: ratio = 1.0, not < 1.0, so False
    codeflash_output = not under_non_alpha_ratio("abcdef", threshold=1.0) # 1.17μs -> 602ns (93.7% faster)

def test_leading_trailing_whitespace():
    # Whitespace should be ignored
    codeflash_output = under_non_alpha_ratio("   a1b2c3   ", threshold=0.6) # 2.32μs -> 1.44μs (61.3% faster)

def test_only_symbols():
    # Only symbols, ratio = 0, so < threshold
    codeflash_output = under_non_alpha_ratio("!@#$%^&*", threshold=0.5) # 1.89μs -> 1.15μs (65.0% faster)

def test_long_string_all_spaces_and_newlines():
    # All whitespace, should return False
    codeflash_output = under_non_alpha_ratio(" \n " * 100) # 11.4μs -> 3.65μs (213% faster)

def test_single_space():
    # Single space, should return False
    codeflash_output = under_non_alpha_ratio(" ") # 1.04μs -> 741ns (39.8% faster)

def test_non_ascii_non_alpha():
    # Non-ASCII, non-alpha (emoji)
    codeflash_output = under_non_alpha_ratio("😀😀😀", threshold=0.5) # 2.42μs -> 1.79μs (34.9% faster)

def test_mixed_emojis_and_alpha():
    # 2 alpha, 2 emoji: ratio = 2/4 = 0.5
    codeflash_output = not under_non_alpha_ratio("a😀b😀", threshold=0.5) # 2.24μs -> 1.50μs (49.3% faster)
    codeflash_output = under_non_alpha_ratio("a😀b😀", threshold=0.6) # 948ns -> 658ns (44.1% faster)

# -------------------------
# LARGE SCALE TEST CASES
# -------------------------

def test_large_all_alpha():
    # 1000 alpha, ratio = 1.0
    s = "a" * 1000
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 47.9μs -> 33.4μs (43.4% faster)

def test_large_all_non_alpha():
    # 1000 non-alpha, ratio = 0.0
    s = "1" * 1000
    codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 40.9μs -> 24.9μs (64.3% faster)

def test_large_mixed_half_and_half():
    # 500 alpha, 500 non-alpha, ratio = 0.5
    s = "a" * 500 + "1" * 500
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 45.2μs -> 28.4μs (59.0% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 43.3μs -> 27.8μs (55.9% faster)

def test_large_with_spaces_ignored():
    # 400 alpha, 400 non-alpha, 200 spaces (should be ignored)
    s = "a" * 400 + "1" * 400 + " " * 200
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 43.5μs -> 24.5μs (77.4% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 41.8μs -> 23.8μs (75.6% faster)

def test_large_unicode_mixed():
    # 300 unicode alpha, 300 unicode non-alpha, 400 ascii alpha
    s = "é" * 300 + "😀" * 300 + "a" * 400
    # alpha: 300 (é) + 400 (a) = 700, non-alpha: 300 (😀), total = 1000
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.8) # 65.3μs -> 36.8μs (77.6% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.71) # 59.0μs -> 34.8μs (69.8% faster)
    # ratio = 700/1000 = 0.7

def test_large_threshold_zero_one():
    # All alpha, threshold=0.0, should be False
    s = "b" * 999
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.0) # 46.2μs -> 33.1μs (39.4% faster)
    # All non-alpha, threshold=1.0, should be True
    s = "!" * 999
    codeflash_output = under_non_alpha_ratio(s, threshold=1.0) # 40.4μs -> 24.3μs (66.0% faster)

def test_large_string_with_whitespace_only():
    # 1000 spaces, should return False
    s = " " * 1000
    codeflash_output = under_non_alpha_ratio(s) # 33.7μs -> 10.0μs (236% faster)

def test_large_string_with_mixed_whitespace_and_chars():
    # 333 alpha, 333 non-alpha, 334 whitespace (ignored)
    s = "a" * 333 + "1" * 333 + " " * 334
    # total_count = 666, alpha = 333, ratio = 0.5
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 41.2μs -> 21.8μs (89.4% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.51) # 40.1μs -> 21.0μs (90.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

# imports
import pytest  # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio

# unit tests

# --- Basic Test Cases ---

def test_all_alpha_default_threshold():
    # All alphabetic, should be False (ratio = 1.0, not under 0.5)
    codeflash_output = under_non_alpha_ratio("HelloWorld") # 3.09μs -> 2.04μs (51.3% faster)

def test_all_non_alpha_default_threshold():
    # All non-alpha (punctuation), should be True (ratio = 0.0)
    codeflash_output = under_non_alpha_ratio("!!!???---") # 2.02μs -> 1.24μs (63.3% faster)

def test_mixed_alpha_non_alpha_default_threshold():
    # 5 alpha, 5 non-alpha, ratio = 0.5, should be False (not under threshold)
    codeflash_output = under_non_alpha_ratio("abc12!@#de") # 2.20μs -> 1.37μs (61.1% faster)

def test_mixed_alpha_non_alpha_just_under_threshold():
    # 2 alpha, 3 non-alpha, ratio = 0.4, should be True (under threshold)
    codeflash_output = under_non_alpha_ratio("a1!b2") # 1.67μs -> 1.19μs (40.9% faster)

def test_spaces_are_ignored():
    # Spaces should not count toward total_count
    # 3 alpha, 2 non-alpha, 2 spaces; ratio = 3/5 = 0.6, should be False
    codeflash_output = under_non_alpha_ratio("a b! c?") # 1.88μs -> 1.19μs (57.4% faster)

def test_threshold_parameter():
    # 2 alpha, 3 non-alpha, total=5, ratio=0.4
    # threshold=0.3 -> False (not under), threshold=0.5 -> True (under)
    codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.3) # 1.71μs -> 1.29μs (32.4% faster)
    codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.5) # 900ns -> 536ns (67.9% faster)

# --- Edge Test Cases ---

def test_empty_string():
    # Empty string should return False
    codeflash_output = under_non_alpha_ratio("") # 425ns -> 367ns (15.8% faster)

def test_only_spaces():
    # Only spaces, total_count=0, should return False
    codeflash_output = under_non_alpha_ratio("     ") # 1.16μs -> 776ns (49.9% faster)

def test_only_alpha_with_spaces():
    # Only alpha and spaces, ratio=1.0, should return False
    codeflash_output = under_non_alpha_ratio("a b c d e") # 1.79μs -> 1.24μs (44.3% faster)

def test_only_non_alpha_with_spaces():
    # Only non-alpha and spaces, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("! @ # $ %") # 1.73μs -> 1.08μs (59.6% faster)

def test_single_alpha():
    # Single alpha, ratio=1.0, should return False
    codeflash_output = under_non_alpha_ratio("A") # 1.38μs -> 1.03μs (33.9% faster)

def test_single_non_alpha():
    # Single non-alpha, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("?") # 1.23μs -> 978ns (26.1% faster)

def test_single_space():
    # Single space, total_count=0, should return False
    codeflash_output = under_non_alpha_ratio(" ") # 1.00μs -> 729ns (37.7% faster)

def test_all_digits():
    # All digits, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("1234567890") # 2.00μs -> 1.21μs (65.4% faster)

def test_unicode_alpha():
    # Unicode alphabetic characters (e.g. accented letters)
    # 3 alpha, 2 non-alpha, ratio=0.6, should be False
    codeflash_output = under_non_alpha_ratio("éàü!!") # 2.19μs -> 1.58μs (39.3% faster)

def test_unicode_non_alpha():
    # Unicode non-alpha (emoji, symbols)
    # 2 non-alpha, 2 alpha, ratio=0.5, should be False
    codeflash_output = under_non_alpha_ratio("a😀b!") # 2.46μs -> 1.69μs (45.9% faster)

def test_threshold_1_0():
    # threshold=1.0, any string with <100% alpha should return True
    # 2 alpha, 2 non-alpha, ratio=0.5 < 1.0
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=1.0) # 1.73μs -> 1.36μs (27.4% faster)

def test_threshold_0_0():
    # threshold=0.0, only strings with 0% alpha should return True
    # 2 alpha, 2 non-alpha, ratio=0.5 > 0.0
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.0) # 1.73μs -> 1.29μs (33.7% faster)
    # All non-alpha, ratio=0.0 == 0.0, should be False (not under threshold)
    codeflash_output = under_non_alpha_ratio("1234", threshold=0.0) # 883ns -> 653ns (35.2% faster)

def test_threshold_exactly_equal():
    # Ratio equals threshold: should return False (not under threshold)
    # 2 alpha, 2 non-alpha, ratio=0.5 == threshold
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.5) # 1.60μs -> 1.20μs (33.1% faster)

def test_tabs_and_newlines_ignored():
    # Tabs and newlines are whitespace, so ignored
    # 2 alpha, 2 non-alpha, 2 whitespace, ratio=2/4=0.5, should be False
    codeflash_output = under_non_alpha_ratio("a\tb\n!") # 1.80μs -> 1.16μs (55.6% faster)

def test_long_repeated_pattern():
    # 500 alpha, 500 non-alpha, ratio=0.5, should be False
    s = "a1" * 500
    codeflash_output = under_non_alpha_ratio(s) # 45.3μs -> 29.1μs (55.5% faster)
    # 499 alpha, 501 non-alpha, ratio=499/1000=0.499, should be True
    s2 = "a1" * 499 + "1!"
    codeflash_output = under_non_alpha_ratio(s2) # 43.8μs -> 28.0μs (56.3% faster)

# --- Large Scale Test Cases ---

def test_large_all_alpha():
    # 1000 alphabetic characters, ratio=1.0, should be False
    s = "a" * 1000
    codeflash_output = under_non_alpha_ratio(s) # 46.5μs -> 32.9μs (41.2% faster)

def test_large_all_non_alpha():
    # 1000 non-alpha characters, ratio=0.0, should be True
    s = "!" * 1000
    codeflash_output = under_non_alpha_ratio(s) # 41.5μs -> 24.6μs (68.4% faster)

def test_large_half_alpha_half_non_alpha():
    # 500 alpha, 500 non-alpha, ratio=0.5, should be False
    s = ("a!" * 500)
    codeflash_output = under_non_alpha_ratio(s) # 44.6μs -> 28.6μs (56.1% faster)

def test_large_sparse_alpha():
    # 10 alpha, 990 non-alpha, ratio=0.01, should be True
    s = "a" + "!" * 99
    s = s * 10  # 10 alpha, 990 non-alpha
    codeflash_output = under_non_alpha_ratio(s) # 41.4μs -> 25.0μs (65.6% faster)

def test_large_sparse_non_alpha():
    # 990 alpha, 10 non-alpha, ratio=0.99, should be False
    s = "a" * 99 + "!"  # 99 alpha, 1 non-alpha
    s = s * 10  # 990 alpha, 10 non-alpha
    codeflash_output = under_non_alpha_ratio(s) # 46.2μs -> 33.0μs (39.9% faster)

def test_large_with_spaces():
    # 500 alpha, 500 non-alpha, 100 spaces (should be ignored)
    s = ("a!" * 500) + (" " * 100)
    codeflash_output = under_non_alpha_ratio(s) # 47.7μs -> 29.7μs (60.6% faster)

def test_large_thresholds():
    # 600 alpha, 400 non-alpha, ratio=0.6
    s = "a" * 600 + "!" * 400
    codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 44.5μs -> 29.4μs (51.1% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.7) # 42.6μs -> 28.5μs (49.4% faster)

# --- Additional Robustness Tests ---

def test_mixed_case_and_symbols():
    # Mixed uppercase, lowercase, digits, symbols
    # 3 alpha, 3 non-alpha, ratio=0.5, should be False
    codeflash_output = under_non_alpha_ratio("A1b2C!") # 1.79μs -> 1.17μs (53.7% faster)

def test_realistic_sentence():
    # Realistic sentence, mostly alpha, some punctuation
    # 20 alpha, 2 non-alpha (comma, period), ratio=20/22 ~ 0.909, should be False
    codeflash_output = under_non_alpha_ratio("Hello, this is a test sentence.") # 3.39μs -> 1.94μs (74.3% faster)

def test_realistic_break_line():
    # Typical break line, mostly non-alpha
    # 1 alpha, 9 non-alpha, ratio=0.1, should be True
    codeflash_output = under_non_alpha_ratio("----BREAK----") # 2.08μs -> 1.32μs (57.9% faster)

def test_space_heavy_string():
    # Spaces should be ignored, only non-space chars count
    # 2 alpha, 2 non-alpha, 10 spaces, ratio=2/4=0.5, should be False
    codeflash_output = under_non_alpha_ratio(" a ! b ?          ") # 2.25μs -> 1.29μs (74.3% faster)

def test_only_whitespace_variety():
    # Only tabs, spaces, newlines, should return False
    codeflash_output = under_non_alpha_ratio(" \t\n\r") # 1.08μs -> 714ns (51.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-under_non_alpha_ratio-mcgm6dor and push.

Codeflash

codeflash-ai bot and others added 4 commits June 28, 2025 19:08
Here's an optimized version of your function.  
Major improvements.
- Only **one pass** through the text string instead of two list comprehensions (saves a ton of memory and CPU).
- No lists are constructed, only simple integer counters.
- `char.strip()` is only used to check for non-space; you can check explicitly for that.

Here's the optimized code with all original comments retained.



This approach processes the string only **once** and uses **O(1) memory** (just two ints). The use of `char.isspace()` is a fast way to check for all Unicode whitespace, just as before. This will significantly speed up your function and eliminate almost all time spent in the original two list comprehensions.
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
@cragwolfe cragwolfe enabled auto-merge August 21, 2025 22:22
@cragwolfe cragwolfe added this pull request to the merge queue Aug 21, 2025
Merged via the queue into Unstructured-IO:main with commit cc635c9 Aug 21, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants