⚡️ Speed up function `sentence_count` by 59% #4080

misrasaurabh1 · 2025-08-21T20:13:34Z

Saurabh's comments - The changes look good, especially because they have been rigorously tested with a variety of cases, which makes me feel confident

📄 59% (0.59x) speedup for `sentence_count` in `unstructured/partition/text_type.py`

⏱️ Runtime : 190 milliseconds → 119 milliseconds (best of 39 runs)

📝 Explanation and details

Major speedups.

Replace list comprehensions with generator expressions in counting scenarios to avoid building intermediate lists.
Use a simple word count (split by space or with str.split()) after punctuation removal, rather than expensive word_tokenize call, since only token count is used and punctuation is already stripped.
Avoid calling remove_punctuation and word_tokenize on already very short sentences if there's a min_length filter: filter quickly if text length is zero.

If you wish to maximize compatibility with sentences containing non-whitespace-separable tokens (e.g. CJK languages), consider further optimization on the token counting line as needed for your domain. Otherwise, str.split() after punctuation removal suffices and is far faster than a full NLP tokenizer.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 21 Passed
🌀 Generated Regression Tests	✅ 92 Passed
⏪ Replay Tests	✅ 695 Passed
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/test_text_type.py::test_item_titles`	92.5μs	47.8μs	✅93.6%
`partition/test_text_type.py::test_sentence_count`	47.8μs	4.67μs	✅924%
`test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count`	4.37ms	2.11ms	✅107%
`test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_sentence_count`	13.0ms	4.69ms	✅177%
`test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_sentence_count`	5.29ms	3.02ms	✅75.3%

🌀 Generated Regression Tests and Runtime

from __future__ import annotations

import random
import string
import sys
import unicodedata
from functools import lru_cache
from typing import Final, List, Optional

# imports
import pytest  # used for our unit tests
from nltk import sent_tokenize as _sent_tokenize
from nltk import word_tokenize as _word_tokenize
from unstructured.cleaners.core import remove_punctuation
from unstructured.logger import trace_logger
from unstructured.nlp.tokenize import sent_tokenize, word_tokenize
from unstructured.partition.text_type import sentence_count

# unit tests

# --------------------
# BASIC TEST CASES
# --------------------

def test_empty_string():
    # Empty string should return 0 sentences
    codeflash_output = sentence_count("") # 7.66μs -> 6.98μs (9.74% faster)

def test_single_sentence():
    # Single sentence with period
    codeflash_output = sentence_count("This is a sentence.") # 36.7μs -> 9.46μs (288% faster)

def test_single_sentence_no_punctuation():
    # Single sentence, no punctuation (should still count as 1 by NLTK)
    codeflash_output = sentence_count("This is a sentence") # 29.5μs -> 7.63μs (286% faster)

def test_two_sentences():
    # Two sentences separated by period
    codeflash_output = sentence_count("This is one. This is two.") # 72.4μs -> 37.0μs (95.7% faster)

def test_multiple_sentences_with_various_punctuation():
    # Sentences ending with ! and ?
    codeflash_output = sentence_count("Is this working? Yes! It is.") # 93.2μs -> 44.5μs (109% faster)

def test_sentence_with_abbreviation():
    # Abbreviations shouldn't split sentences
    codeflash_output = sentence_count("Dr. Smith went home. He was tired.") # 80.0μs -> 42.3μs (89.0% faster)

def test_sentence_with_ellipsis():
    # Ellipsis should not split sentences
    codeflash_output = sentence_count("Wait... what happened? I don't know.") # 76.9μs -> 42.5μs (81.0% faster)

def test_sentence_with_newlines():
    # Sentences separated by newlines
    codeflash_output = sentence_count("First sentence.\nSecond sentence.\nThird sentence.") # 91.7μs -> 43.4μs (111% faster)

def test_sentence_with_min_length_met():
    # min_length is met for all sentences
    codeflash_output = sentence_count("One two three. Four five six.", min_length=2) # 63.1μs -> 30.2μs (109% faster)

def test_sentence_with_min_length_not_met():
    # Only one sentence meets min_length
    codeflash_output = sentence_count("One. Two three four.", min_length=3) # 62.8μs -> 31.9μs (97.1% faster)

def test_sentence_with_min_length_none_met():
    # No sentence meets min_length
    codeflash_output = sentence_count("A. B.", min_length=2) # 60.9μs -> 32.3μs (88.5% faster)

def test_sentence_with_min_length_equals_length():
    # Sentence with exactly min_length words
    codeflash_output = sentence_count("One two three.", min_length=3) # 27.3μs -> 9.54μs (187% faster)

def test_sentence_with_trailing_space():
    # Sentence with trailing spaces
    codeflash_output = sentence_count("Hello world.   ") # 27.8μs -> 8.60μs (223% faster)

# --------------------
# EDGE TEST CASES
# --------------------

def test_only_punctuation():
    # Only punctuation, no words
    codeflash_output = sentence_count("...!!!") # 39.1μs -> 33.5μs (16.8% faster)

def test_only_whitespace():
    # Only whitespace
    codeflash_output = sentence_count("    \n\t   ") # 5.21μs -> 5.49μs (5.05% slower)

def test_sentence_with_numbers_and_symbols():
    # Sentence with numbers and symbols
    codeflash_output = sentence_count("12345! $%^&*()") # 66.6μs -> 32.7μs (104% faster)

def test_sentence_with_unicode_characters():
    # Sentences with unicode and emoji
    codeflash_output = sentence_count("Hello 😊. How are you?") # 75.9μs -> 37.7μs (102% faster)

def test_sentence_with_mixed_scripts():
    # Sentences with mixed scripts (e.g., English and Japanese)
    codeflash_output = sentence_count("Hello. こんにちは。How are you?") # 71.2μs -> 34.9μs (104% faster)

def test_sentence_with_multiple_spaces():
    # Sentences with irregular spacing
    codeflash_output = sentence_count("This   is   spaced.   And   so   is   this.") # 69.5μs -> 30.3μs (129% faster)

def test_sentence_with_no_word_characters():
    # Only punctuation and numbers
    codeflash_output = sentence_count("... 123 ...") # 42.1μs -> 25.5μs (65.2% faster)

def test_sentence_with_long_word():
    # Sentence with a single long word
    long_word = "a" * 100
    codeflash_output = sentence_count(f"{long_word}.") # 42.0μs -> 7.55μs (457% faster)

def test_sentence_with_long_word_and_min_length():
    # Sentence with long word, min_length > 1
    long_word = "a" * 100
    codeflash_output = sentence_count(f"{long_word}.", min_length=2) # 43.1μs -> 10.7μs (303% faster)

def test_sentence_with_only_abbreviation():
    # Sentence is only an abbreviation
    codeflash_output = sentence_count("U.S.A.") # 23.0μs -> 7.46μs (208% faster)

def test_sentence_with_nonbreaking_space():
    # Sentence with non-breaking space
    text = "Hello\u00A0world. How are you?"
    codeflash_output = sentence_count(text) # 74.4μs -> 37.3μs (99.6% faster)

def test_sentence_with_tab_characters():
    # Sentences separated by tabs
    text = "Hello world.\tHow are you?\tFine."
    codeflash_output = sentence_count(text) # 100μs -> 44.5μs (125% faster)

def test_sentence_with_multiple_punctuation_marks():
    # Sentences ending with multiple punctuation marks
    text = "Wait!! What?? Really..."
    codeflash_output = sentence_count(text) # 88.9μs -> 48.3μs (83.9% faster)

def test_sentence_with_leading_and_trailing_punctuation():
    # Sentence surrounded by punctuation
    text = "...Hello world!..."
    codeflash_output = sentence_count(text) # 25.9μs -> 8.64μs (200% faster)

def test_sentence_with_quotes():
    # Sentences with quotes
    text = '"Hello," she said. "How are you?"'
    codeflash_output = sentence_count(text) # 85.3μs -> 46.6μs (83.1% faster)

def test_sentence_with_parentheses():
    # Sentence with parentheses
    text = "This is a sentence (with parentheses). This is another."
    codeflash_output = sentence_count(text) # 74.7μs -> 33.5μs (123% faster)

def test_sentence_with_semicolons():
    # Semicolons should not split sentences
    text = "This is a sentence; this is not a new sentence."
    codeflash_output = sentence_count(text) # 37.1μs -> 8.74μs (324% faster)

def test_sentence_with_colons():
    # Colons should not split sentences
    text = "This is a sentence: it continues here."
    codeflash_output = sentence_count(text) # 33.2μs -> 8.44μs (294% faster)

def test_sentence_with_dash():
    # Dashes should not split sentences
    text = "This is a sentence - it continues here."
    codeflash_output = sentence_count(text) # 34.8μs -> 8.52μs (309% faster)

def test_sentence_with_multiple_dots():
    # Multiple dots but not ellipsis
    text = "This is a sentence.... This is another."
    codeflash_output = sentence_count(text) # 79.9μs -> 38.2μs (109% faster)

def test_sentence_with_min_length_and_punctuation():
    # min_length with sentences containing only punctuation
    text = "!!! ... ???"
    codeflash_output = sentence_count(text, min_length=1) # 80.2μs -> 77.0μs (4.21% faster)

def test_sentence_with_min_length_and_numbers():
    # min_length with numbers as words
    text = "1 2 3 4. 5 6."
    codeflash_output = sentence_count(text, min_length=4) # 69.6μs -> 34.6μs (101% faster)

def test_sentence_with_min_length_and_unicode():
    # min_length with unicode
    text = "😊 😊 😊 😊. Hello!"
    codeflash_output = sentence_count(text, min_length=4) # 76.1μs -> 41.7μs (82.5% faster)

def test_sentence_with_non_ascii_punctuation():
    # Sentence with non-ASCII punctuation (e.g., Chinese full stop)
    text = "Hello world。How are you？"
    codeflash_output = sentence_count(text) # 31.9μs -> 9.34μs (242% faster)

def test_sentence_with_repeated_newlines():
    # Sentences separated by multiple newlines
    text = "First sentence.\n\n\nSecond sentence."
    codeflash_output = sentence_count(text) # 71.9μs -> 33.4μs (115% faster)

# --------------------
# LARGE SCALE TEST CASES
# --------------------

def test_large_number_of_sentences():
    # 1000 sentences, each "Sentence X."
    n = 1000
    text = " ".join([f"Sentence {i}." for i in range(n)])
    codeflash_output = sentence_count(text) # 21.2ms -> 8.43ms (151% faster)

def test_large_number_of_sentences_with_min_length():
    # 1000 sentences, every even-indexed has 3 words, odd-indexed has 1 word
    n = 1000
    sentences = []
    for i in range(n):
        if i % 2 == 0:
            sentences.append(f"Word1 Word2 Word3.")
        else:
            sentences.append(f"Word.")
    text = " ".join(sentences)
    # Only even-indexed sentences should count for min_length=3
    codeflash_output = sentence_count(text, min_length=3)

def test_large_sentence():
    # One very long sentence (999 words)
    sentence = " ".join(["word"] * 999) + "."
    codeflash_output = sentence_count(sentence) # 1.15ms -> 29.0μs (3854% faster)
    codeflash_output = sentence_count(sentence, min_length=999) # 18.7μs -> 41.1μs (54.5% slower)
    codeflash_output = sentence_count(sentence, min_length=1000) # 18.4μs -> 38.6μs (52.3% slower)

def test_large_text_with_varied_sentence_lengths():
    # 500 short sentences, 500 long sentences (5 and 20 words)
    n_short = 500
    n_long = 500
    short_sentence = "a b c d e."
    long_sentence = " ".join(["word"] * 20) + "."
    text = " ".join([short_sentence]*n_short + [long_sentence]*n_long)
    # min_length=10 should only count long sentences
    codeflash_output = sentence_count(text, min_length=10) # 8.33ms -> 7.46ms (11.7% faster)
    # min_length=1 should count all
    codeflash_output = sentence_count(text, min_length=1) # 412μs -> 722μs (42.9% slower)

def test_large_text_with_unicode_and_punctuation():
    # 1000 sentences, each with emoji and punctuation
    n = 1000
    text = " ".join([f"Hello 😊! How are you?"] * n)
    # Each repetition has 2 sentences
    codeflash_output = sentence_count(text) # 15.8ms -> 15.3ms (3.27% faster)

def test_large_text_with_random_punctuation():
    # 1000 sentences with random punctuation at the end
    n = 1000
    punctuations = [".", "!", "?"]
    text = " ".join([f"Sentence {i}{random.choice(punctuations)}" for i in range(n)])
    codeflash_output = sentence_count(text) # 20.6ms -> 8.02ms (157% faster)

def test_large_text_with_abbreviations():
    # 1000 sentences, some with abbreviations
    n = 1000
    text = " ".join([f"Dr. Smith went home. He was tired."] * (n // 2))
    # Each repetition has 2 sentences
    codeflash_output = sentence_count(text) # 11.6ms -> 11.2ms (3.99% faster)

def test_large_text_with_newlines_and_tabs():
    # 500 sentences separated by newlines, 500 by tabs
    n = 500
    text1 = "\n".join([f"Sentence {i}." for i in range(n)])
    text2 = "\t".join([f"Sentence {i}." for i in range(n, 2*n)])
    text = text1 + "\n" + text2
    codeflash_output = sentence_count(text) # 21.3ms -> 8.57ms (149% faster)

def test_large_text_with_min_length_and_unicode():
    # 1000 sentences, half with 5 emojis, half with 1 emoji
    n = 1000
    text = " ".join(["😊 " * 5 + "." if i % 2 == 0 else "😊." for i in range(n)])
    # min_length=5 should count only even-indexed
    codeflash_output = sentence_count(text, min_length=5) # 7.88ms -> 7.83ms (0.679% faster)
    # min_length=1 should count all
    codeflash_output = sentence_count(text, min_length=1) # 573μs -> 742μs (22.7% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import string
import sys
import unicodedata
from functools import lru_cache
from typing import Final, List, Optional

# imports
import pytest  # used for our unit tests
from nltk import sent_tokenize as _sent_tokenize
from nltk import word_tokenize as _word_tokenize
from unstructured.partition.text_type import sentence_count


# Dummy trace_logger for test purposes (since real logger is not available)
class DummyLogger:
    def detail(self, msg):
        pass
trace_logger = DummyLogger()
from unstructured.partition.text_type import sentence_count

# unit tests

# ---------------- BASIC TEST CASES ----------------

def test_single_sentence():
    # A simple sentence
    codeflash_output = sentence_count("This is a sentence.") # 34.7μs -> 10.1μs (244% faster)

def test_multiple_sentences():
    # Two distinct sentences
    codeflash_output = sentence_count("This is the first sentence. This is the second.") # 77.6μs -> 33.9μs (129% faster)

def test_sentence_with_min_length_met():
    # Sentence with enough words for min_length
    codeflash_output = sentence_count("This is a long enough sentence.", min_length=5) # 33.7μs -> 10.9μs (209% faster)

def test_sentence_with_min_length_not_met():
    # Sentence with too few words for min_length
    codeflash_output = sentence_count("Too short.", min_length=3) # 28.4μs -> 11.1μs (156% faster)

def test_multiple_sentences_with_min_length():
    # Only one of two sentences meets min_length
    text = "Short. This one is long enough."
    codeflash_output = sentence_count(text, min_length=4) # 73.7μs -> 37.0μs (99.1% faster)

def test_sentence_with_punctuation():
    # Sentence with internal punctuation
    text = "Hello, world! How are you?"
    codeflash_output = sentence_count(text) # 66.2μs -> 29.9μs (121% faster)

def test_sentence_with_abbreviations():
    # Sentence with abbreviation that should not split sentences
    text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was sunny."
    codeflash_output = sentence_count(text) # 117μs -> 61.3μs (92.3% faster)

# ---------------- EDGE TEST CASES ----------------

def test_empty_string():
    # Empty string should yield 0 sentences
    codeflash_output = sentence_count("") # 5.53μs -> 5.59μs (1.02% slower)

def test_whitespace_only():
    # String with only whitespace
    codeflash_output = sentence_count("   ") # 5.11μs -> 5.48μs (6.73% slower)

def test_no_sentence_ending_punctuation():
    # No periods, exclamation or question marks
    codeflash_output = sentence_count("This is not split into sentences") # 35.0μs -> 7.85μs (346% faster)

def test_sentence_with_only_punctuation():
    # String with only punctuation marks
    codeflash_output = sentence_count("!!!...???") # 47.1μs -> 41.7μs (12.9% faster)

def test_sentence_with_newlines():
    # Sentences split by newlines
    text = "First sentence.\nSecond sentence.\n\nThird sentence."
    codeflash_output = sentence_count(text) # 98.5μs -> 47.0μs (110% faster)

def test_sentence_with_multiple_spaces():
    # Sentences separated by multiple spaces
    text = "Sentence one.   Sentence two.      Sentence three."
    codeflash_output = sentence_count(text) # 95.3μs -> 42.1μs (126% faster)

def test_sentence_with_unicode_punctuation():
    # Sentences with unicode punctuation (em dash, ellipsis, etc.)
    text = "Hello… How are you—good?"
    codeflash_output = sentence_count(text) # 32.4μs -> 11.0μs (196% faster)

def test_sentence_with_non_ascii_characters():
    # Sentences with non-ASCII (e.g., accented) characters
    text = "C'est la vie. Voilà!"
    codeflash_output = sentence_count(text) # 73.6μs -> 34.4μs (114% faster)

def test_sentence_with_numbers_and_periods():
    # Numbers with periods should not split sentences
    text = "Version 3.2 is out. Please update."
    codeflash_output = sentence_count(text) # 66.2μs -> 29.1μs (128% faster)

def test_sentence_with_emoji():
    # Sentences with emoji
    text = "I am happy 😊. Are you?"
    codeflash_output = sentence_count(text) # 73.7μs -> 33.8μs (118% faster)

def test_sentence_with_tabs_and_spaces():
    # Sentences separated by tabs and spaces
    text = "First sentence.\tSecond sentence.   Third sentence."
    codeflash_output = sentence_count(text) # 45.6μs -> 44.1μs (3.20% faster)

def test_sentence_with_min_length_zero():
    # min_length=0 should count all sentences
    text = "One. Two. Three."
    codeflash_output = sentence_count(text, min_length=0) # 88.5μs -> 40.5μs (119% faster)

def test_sentence_with_min_length_equals_num_words():
    # min_length equal to the number of words in a sentence
    text = "This is five words."
    codeflash_output = sentence_count(text, min_length=5) # 31.6μs -> 12.3μs (157% faster)

def test_sentence_with_min_length_greater_than_any_sentence():
    # min_length greater than any sentence's word count
    text = "Short. Tiny. Small."
    codeflash_output = sentence_count(text, min_length=10) # 79.2μs -> 47.8μs (65.7% faster)

def test_sentence_with_trailing_and_leading_spaces():
    # Sentences with leading/trailing spaces
    text = "   First sentence. Second sentence.   "
    codeflash_output = sentence_count(text) # 50.5μs -> 29.3μs (72.2% faster)

def test_sentence_with_only_newlines():
    # Only newlines
    text = "\n\n\n"
    codeflash_output = sentence_count(text) # 5.11μs -> 5.40μs (5.31% slower)

def test_sentence_with_multiple_punctuation_marks():
    # Sentences ending with multiple punctuation marks
    text = "What?! Really?! Yes."
    codeflash_output = sentence_count(text) # 99.1μs -> 49.6μs (99.9% faster)

def test_sentence_with_quoted_text():
    # Sentences with quoted text
    text = '"Hello there." She said. "How are you?"'
    codeflash_output = sentence_count(text) # 97.5μs -> 58.4μs (66.9% faster)

def test_sentence_with_parentheses():
    # Sentences with parentheses
    text = "This is a sentence (with extra info). Another sentence."
    codeflash_output = sentence_count(text) # 77.1μs -> 32.4μs (138% faster)

def test_sentence_with_semicolons_and_colons():
    # Semicolons and colons should not split sentences
    text = "First part; still same sentence: more info. Next sentence."
    codeflash_output = sentence_count(text) # 72.1μs -> 28.5μs (153% faster)

def test_sentence_with_single_word():
    # Single word, with and without punctuation
    codeflash_output = sentence_count("Hello.") # 23.6μs -> 7.43μs (218% faster)
    codeflash_output = sentence_count("Hello") # 3.91μs -> 5.14μs (24.0% slower)

def test_sentence_with_multiple_periods():
    # Ellipsis should not split into multiple sentences
    text = "Wait... What happened?"
    codeflash_output = sentence_count(text) # 52.3μs -> 29.7μs (76.4% faster)

def test_sentence_with_uppercase_acronyms():
    # Acronyms with periods should not split sentences
    text = "I work at U.S.A. headquarters. It's nice."
    codeflash_output = sentence_count(text) # 90.6μs -> 50.5μs (79.4% faster)

def test_sentence_with_decimal_numbers():
    # Decimal numbers should not split sentences
    text = "The value is 3.14. That's pi."
    codeflash_output = sentence_count(text) # 75.2μs -> 35.4μs (112% faster)

def test_sentence_with_bullet_points():
    # Bullet points without ending punctuation
    text = "• First item\n• Second item\n• Third item"
    codeflash_output = sentence_count(text) # 35.3μs -> 10.4μs (240% faster)

def test_sentence_with_dash_and_hyphen():
    # Dashes and hyphens should not split sentences
    text = "Well-known fact—it's true. Next sentence."
    codeflash_output = sentence_count(text) # 55.1μs -> 35.1μs (56.9% faster)

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_text_many_sentences():
    # Test with a large number of sentences
    text = " ".join([f"Sentence number {i}." for i in range(1, 501)])
    codeflash_output = sentence_count(text) # 11.5ms -> 4.39ms (162% faster)

def test_large_text_with_min_length():
    # Large text, only some sentences meet min_length
    text = "Short. " * 500 + "This is a sufficiently long sentence for counting. " * 200
    # Only the long sentences (7 words) should be counted
    codeflash_output = sentence_count(text, min_length=7) # 6.24ms -> 6.19ms (0.716% faster)

def test_large_text_no_sentences():
    # Large text with no sentence-ending punctuation
    text = " ".join(["word"] * 1000)
    codeflash_output = sentence_count(text) # 1.12ms -> 27.9μs (3935% faster)

def test_large_text_all_sentences_filtered_by_min_length():
    # All sentences too short for min_length
    text = "A. B. C. D. " * 250
    codeflash_output = sentence_count(text, min_length=5) # 7.01ms -> 6.93ms (1.12% faster)

def test_large_text_with_varied_sentence_lengths():
    # Mix of short and long sentences
    short = "Hi. " * 300
    long = "This is a longer sentence for testing. " * 100
    text = short + long
    codeflash_output = sentence_count(text, min_length=6) # 3.39ms -> 3.33ms (1.73% faster)

def test_large_text_with_unicode_and_emoji():
    # Large text with unicode and emoji in sentences
    text = "😊 Hello world! " * 400 + "C'est la vie. Voilà! " * 100
    codeflash_output = sentence_count(text) # 5.41ms -> 5.16ms (4.84% faster)

def test_large_text_with_newlines_and_tabs():
    # Large text with newlines and tabs between sentences
    text = "\n".join([f"Sentence {i}.\t" for i in range(1, 501)])
    codeflash_output = sentence_count(text) # 11.0ms -> 4.52ms (144% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-sentence_count-mcglwwcn and push.

Here is your optimized code. Major speedups. - Replace list comprehensions with generator expressions in counting scenarios to avoid building intermediate lists. - Use a simple word count (split by space or with str.split()) after punctuation removal, rather than expensive word_tokenize call, since only token count is used and punctuation is already stripped. - Avoid calling remove_punctuation and word_tokenize on already very short sentences if there's a min_length filter: filter quickly if text length is zero. - Remove unnecessary import of sent_tokenize and word_tokenize from **unstructured.nlp.tokenize** since we shadow them with the local definitions. **All docstrings and core function signatures are preserved.** **All external calls and logging are preserved.** **All comments are preserved unless implementation has changed.** **Summary of key changes for speed:** - Stop double-importing and shadowing tokenize functions. - Use `str.split()` instead of `word_tokenize` after removing punctuation when only the number of tokens is needed, which is far faster. - Eliminate creation of temporary word lists purely for counting. - Only call remove_punctuation once per sentence per iteration. If you wish to maximize compatibility with sentences containing non-whitespace-separable tokens (e.g. CJK languages), consider further optimization on the token counting line as needed for your domain. Otherwise, `str.split()` after punctuation removal suffices and is far faster than a full NLP tokenizer. Let me know if you need a further optimized/remixed version!

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>

codeflash-ai bot and others added 5 commits June 28, 2025 19:01

Update unstructured/partition/text_type.py

86c8485

Merge branch 'main' into codeflash/optimize-sentence_count-mcglwwcn

5ff186b

changelog update

7650030

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>

changelog update

45b9209

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>

cragwolfe approved these changes Aug 22, 2025

View reviewed changes

cragwolfe added this pull request to the merge queue Aug 22, 2025

Merged via the queue into Unstructured-IO:main with commit 51425dd Aug 22, 2025
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `sentence_count` by 59% #4080

⚡️ Speed up function `sentence_count` by 59% #4080

Uh oh!

misrasaurabh1 commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function sentence_count by 59% #4080

⚡️ Speed up function sentence_count by 59% #4080

Uh oh!

Conversation

misrasaurabh1 commented Aug 21, 2025

📄 59% (0.59x) speedup for sentence_count in unstructured/partition/text_type.py

📝 Explanation and details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `sentence_count` by 59% #4080

⚡️ Speed up function `sentence_count` by 59% #4080

📄 59% (0.59x) speedup for `sentence_count` in `unstructured/partition/text_type.py`