Skip to content

Commit 51425dd

Browse files
⚡️ Speed up function sentence_count by 59% (#4080)
Saurabh's comments - The changes look good, especially because they have been rigorously tested with a variety of cases, which makes me feel confident ### 📄 59% (0.59x) speedup for ***`sentence_count` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`190 milliseconds`** **→** **`119 milliseconds`** (best of `39` runs) ### 📝 Explanation and details Major speedups. - Replace list comprehensions with generator expressions in counting scenarios to avoid building intermediate lists. - Use a simple word count (split by space or with str.split()) after punctuation removal, rather than expensive word_tokenize call, since only token count is used and punctuation is already stripped. - Avoid calling remove_punctuation and word_tokenize on already very short sentences if there's a min_length filter: filter quickly if text length is zero. - If you wish to maximize compatibility with sentences containing non-whitespace-separable tokens (e.g. CJK languages), consider further optimization on the token counting line as needed for your domain. Otherwise, `str.split()` after punctuation removal suffices and is far faster than a full NLP tokenizer. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **21 Passed** | | 🌀 Generated Regression Tests | ✅ **92 Passed** | | ⏪ Replay Tests | ✅ **695 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-----------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_item_titles` | 92.5μs | 47.8μs | ✅93.6% | | `partition/test_text_type.py::test_sentence_count` | 47.8μs | 4.67μs | ✅924% | | `test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count` | 4.37ms | 2.11ms | ✅107% | | `test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_sentence_count` | 13.0ms | 4.69ms | ✅177% | | `test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_sentence_count` | 5.29ms | 3.02ms | ✅75.3% | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations import random import string import sys import unicodedata from functools import lru_cache from typing import Final, List, Optional # imports import pytest # used for our unit tests from nltk import sent_tokenize as _sent_tokenize from nltk import word_tokenize as _word_tokenize from unstructured.cleaners.core import remove_punctuation from unstructured.logger import trace_logger from unstructured.nlp.tokenize import sent_tokenize, word_tokenize from unstructured.partition.text_type import sentence_count # unit tests # -------------------- # BASIC TEST CASES # -------------------- def test_empty_string(): # Empty string should return 0 sentences codeflash_output = sentence_count("") # 7.66μs -> 6.98μs (9.74% faster) def test_single_sentence(): # Single sentence with period codeflash_output = sentence_count("This is a sentence.") # 36.7μs -> 9.46μs (288% faster) def test_single_sentence_no_punctuation(): # Single sentence, no punctuation (should still count as 1 by NLTK) codeflash_output = sentence_count("This is a sentence") # 29.5μs -> 7.63μs (286% faster) def test_two_sentences(): # Two sentences separated by period codeflash_output = sentence_count("This is one. This is two.") # 72.4μs -> 37.0μs (95.7% faster) def test_multiple_sentences_with_various_punctuation(): # Sentences ending with ! and ? codeflash_output = sentence_count("Is this working? Yes! It is.") # 93.2μs -> 44.5μs (109% faster) def test_sentence_with_abbreviation(): # Abbreviations shouldn't split sentences codeflash_output = sentence_count("Dr. Smith went home. He was tired.") # 80.0μs -> 42.3μs (89.0% faster) def test_sentence_with_ellipsis(): # Ellipsis should not split sentences codeflash_output = sentence_count("Wait... what happened? I don't know.") # 76.9μs -> 42.5μs (81.0% faster) def test_sentence_with_newlines(): # Sentences separated by newlines codeflash_output = sentence_count("First sentence.\nSecond sentence.\nThird sentence.") # 91.7μs -> 43.4μs (111% faster) def test_sentence_with_min_length_met(): # min_length is met for all sentences codeflash_output = sentence_count("One two three. Four five six.", min_length=2) # 63.1μs -> 30.2μs (109% faster) def test_sentence_with_min_length_not_met(): # Only one sentence meets min_length codeflash_output = sentence_count("One. Two three four.", min_length=3) # 62.8μs -> 31.9μs (97.1% faster) def test_sentence_with_min_length_none_met(): # No sentence meets min_length codeflash_output = sentence_count("A. B.", min_length=2) # 60.9μs -> 32.3μs (88.5% faster) def test_sentence_with_min_length_equals_length(): # Sentence with exactly min_length words codeflash_output = sentence_count("One two three.", min_length=3) # 27.3μs -> 9.54μs (187% faster) def test_sentence_with_trailing_space(): # Sentence with trailing spaces codeflash_output = sentence_count("Hello world. ") # 27.8μs -> 8.60μs (223% faster) # -------------------- # EDGE TEST CASES # -------------------- def test_only_punctuation(): # Only punctuation, no words codeflash_output = sentence_count("...!!!") # 39.1μs -> 33.5μs (16.8% faster) def test_only_whitespace(): # Only whitespace codeflash_output = sentence_count(" \n\t ") # 5.21μs -> 5.49μs (5.05% slower) def test_sentence_with_numbers_and_symbols(): # Sentence with numbers and symbols codeflash_output = sentence_count("12345! $%^&*()") # 66.6μs -> 32.7μs (104% faster) def test_sentence_with_unicode_characters(): # Sentences with unicode and emoji codeflash_output = sentence_count("Hello 😊. How are you?") # 75.9μs -> 37.7μs (102% faster) def test_sentence_with_mixed_scripts(): # Sentences with mixed scripts (e.g., English and Japanese) codeflash_output = sentence_count("Hello. こんにちは。How are you?") # 71.2μs -> 34.9μs (104% faster) def test_sentence_with_multiple_spaces(): # Sentences with irregular spacing codeflash_output = sentence_count("This is spaced. And so is this.") # 69.5μs -> 30.3μs (129% faster) def test_sentence_with_no_word_characters(): # Only punctuation and numbers codeflash_output = sentence_count("... 123 ...") # 42.1μs -> 25.5μs (65.2% faster) def test_sentence_with_long_word(): # Sentence with a single long word long_word = "a" * 100 codeflash_output = sentence_count(f"{long_word}.") # 42.0μs -> 7.55μs (457% faster) def test_sentence_with_long_word_and_min_length(): # Sentence with long word, min_length > 1 long_word = "a" * 100 codeflash_output = sentence_count(f"{long_word}.", min_length=2) # 43.1μs -> 10.7μs (303% faster) def test_sentence_with_only_abbreviation(): # Sentence is only an abbreviation codeflash_output = sentence_count("U.S.A.") # 23.0μs -> 7.46μs (208% faster) def test_sentence_with_nonbreaking_space(): # Sentence with non-breaking space text = "Hello\u00A0world. How are you?" codeflash_output = sentence_count(text) # 74.4μs -> 37.3μs (99.6% faster) def test_sentence_with_tab_characters(): # Sentences separated by tabs text = "Hello world.\tHow are you?\tFine." codeflash_output = sentence_count(text) # 100μs -> 44.5μs (125% faster) def test_sentence_with_multiple_punctuation_marks(): # Sentences ending with multiple punctuation marks text = "Wait!! What?? Really..." codeflash_output = sentence_count(text) # 88.9μs -> 48.3μs (83.9% faster) def test_sentence_with_leading_and_trailing_punctuation(): # Sentence surrounded by punctuation text = "...Hello world!..." codeflash_output = sentence_count(text) # 25.9μs -> 8.64μs (200% faster) def test_sentence_with_quotes(): # Sentences with quotes text = '"Hello," she said. "How are you?"' codeflash_output = sentence_count(text) # 85.3μs -> 46.6μs (83.1% faster) def test_sentence_with_parentheses(): # Sentence with parentheses text = "This is a sentence (with parentheses). This is another." codeflash_output = sentence_count(text) # 74.7μs -> 33.5μs (123% faster) def test_sentence_with_semicolons(): # Semicolons should not split sentences text = "This is a sentence; this is not a new sentence." codeflash_output = sentence_count(text) # 37.1μs -> 8.74μs (324% faster) def test_sentence_with_colons(): # Colons should not split sentences text = "This is a sentence: it continues here." codeflash_output = sentence_count(text) # 33.2μs -> 8.44μs (294% faster) def test_sentence_with_dash(): # Dashes should not split sentences text = "This is a sentence - it continues here." codeflash_output = sentence_count(text) # 34.8μs -> 8.52μs (309% faster) def test_sentence_with_multiple_dots(): # Multiple dots but not ellipsis text = "This is a sentence.... This is another." codeflash_output = sentence_count(text) # 79.9μs -> 38.2μs (109% faster) def test_sentence_with_min_length_and_punctuation(): # min_length with sentences containing only punctuation text = "!!! ... ???" codeflash_output = sentence_count(text, min_length=1) # 80.2μs -> 77.0μs (4.21% faster) def test_sentence_with_min_length_and_numbers(): # min_length with numbers as words text = "1 2 3 4. 5 6." codeflash_output = sentence_count(text, min_length=4) # 69.6μs -> 34.6μs (101% faster) def test_sentence_with_min_length_and_unicode(): # min_length with unicode text = "😊 😊 😊 😊. Hello!" codeflash_output = sentence_count(text, min_length=4) # 76.1μs -> 41.7μs (82.5% faster) def test_sentence_with_non_ascii_punctuation(): # Sentence with non-ASCII punctuation (e.g., Chinese full stop) text = "Hello world。How are you?" codeflash_output = sentence_count(text) # 31.9μs -> 9.34μs (242% faster) def test_sentence_with_repeated_newlines(): # Sentences separated by multiple newlines text = "First sentence.\n\n\nSecond sentence." codeflash_output = sentence_count(text) # 71.9μs -> 33.4μs (115% faster) # -------------------- # LARGE SCALE TEST CASES # -------------------- def test_large_number_of_sentences(): # 1000 sentences, each "Sentence X." n = 1000 text = " ".join([f"Sentence {i}." for i in range(n)]) codeflash_output = sentence_count(text) # 21.2ms -> 8.43ms (151% faster) def test_large_number_of_sentences_with_min_length(): # 1000 sentences, every even-indexed has 3 words, odd-indexed has 1 word n = 1000 sentences = [] for i in range(n): if i % 2 == 0: sentences.append(f"Word1 Word2 Word3.") else: sentences.append(f"Word.") text = " ".join(sentences) # Only even-indexed sentences should count for min_length=3 codeflash_output = sentence_count(text, min_length=3) def test_large_sentence(): # One very long sentence (999 words) sentence = " ".join(["word"] * 999) + "." codeflash_output = sentence_count(sentence) # 1.15ms -> 29.0μs (3854% faster) codeflash_output = sentence_count(sentence, min_length=999) # 18.7μs -> 41.1μs (54.5% slower) codeflash_output = sentence_count(sentence, min_length=1000) # 18.4μs -> 38.6μs (52.3% slower) def test_large_text_with_varied_sentence_lengths(): # 500 short sentences, 500 long sentences (5 and 20 words) n_short = 500 n_long = 500 short_sentence = "a b c d e." long_sentence = " ".join(["word"] * 20) + "." text = " ".join([short_sentence]*n_short + [long_sentence]*n_long) # min_length=10 should only count long sentences codeflash_output = sentence_count(text, min_length=10) # 8.33ms -> 7.46ms (11.7% faster) # min_length=1 should count all codeflash_output = sentence_count(text, min_length=1) # 412μs -> 722μs (42.9% slower) def test_large_text_with_unicode_and_punctuation(): # 1000 sentences, each with emoji and punctuation n = 1000 text = " ".join([f"Hello 😊! How are you?"] * n) # Each repetition has 2 sentences codeflash_output = sentence_count(text) # 15.8ms -> 15.3ms (3.27% faster) def test_large_text_with_random_punctuation(): # 1000 sentences with random punctuation at the end n = 1000 punctuations = [".", "!", "?"] text = " ".join([f"Sentence {i}{random.choice(punctuations)}" for i in range(n)]) codeflash_output = sentence_count(text) # 20.6ms -> 8.02ms (157% faster) def test_large_text_with_abbreviations(): # 1000 sentences, some with abbreviations n = 1000 text = " ".join([f"Dr. Smith went home. He was tired."] * (n // 2)) # Each repetition has 2 sentences codeflash_output = sentence_count(text) # 11.6ms -> 11.2ms (3.99% faster) def test_large_text_with_newlines_and_tabs(): # 500 sentences separated by newlines, 500 by tabs n = 500 text1 = "\n".join([f"Sentence {i}." for i in range(n)]) text2 = "\t".join([f"Sentence {i}." for i in range(n, 2*n)]) text = text1 + "\n" + text2 codeflash_output = sentence_count(text) # 21.3ms -> 8.57ms (149% faster) def test_large_text_with_min_length_and_unicode(): # 1000 sentences, half with 5 emojis, half with 1 emoji n = 1000 text = " ".join(["😊 " * 5 + "." if i % 2 == 0 else "😊." for i in range(n)]) # min_length=5 should count only even-indexed codeflash_output = sentence_count(text, min_length=5) # 7.88ms -> 7.83ms (0.679% faster) # min_length=1 should count all codeflash_output = sentence_count(text, min_length=1) # 573μs -> 742μs (22.7% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from __future__ import annotations import string import sys import unicodedata from functools import lru_cache from typing import Final, List, Optional # imports import pytest # used for our unit tests from nltk import sent_tokenize as _sent_tokenize from nltk import word_tokenize as _word_tokenize from unstructured.partition.text_type import sentence_count # Dummy trace_logger for test purposes (since real logger is not available) class DummyLogger: def detail(self, msg): pass trace_logger = DummyLogger() from unstructured.partition.text_type import sentence_count # unit tests # ---------------- BASIC TEST CASES ---------------- def test_single_sentence(): # A simple sentence codeflash_output = sentence_count("This is a sentence.") # 34.7μs -> 10.1μs (244% faster) def test_multiple_sentences(): # Two distinct sentences codeflash_output = sentence_count("This is the first sentence. This is the second.") # 77.6μs -> 33.9μs (129% faster) def test_sentence_with_min_length_met(): # Sentence with enough words for min_length codeflash_output = sentence_count("This is a long enough sentence.", min_length=5) # 33.7μs -> 10.9μs (209% faster) def test_sentence_with_min_length_not_met(): # Sentence with too few words for min_length codeflash_output = sentence_count("Too short.", min_length=3) # 28.4μs -> 11.1μs (156% faster) def test_multiple_sentences_with_min_length(): # Only one of two sentences meets min_length text = "Short. This one is long enough." codeflash_output = sentence_count(text, min_length=4) # 73.7μs -> 37.0μs (99.1% faster) def test_sentence_with_punctuation(): # Sentence with internal punctuation text = "Hello, world! How are you?" codeflash_output = sentence_count(text) # 66.2μs -> 29.9μs (121% faster) def test_sentence_with_abbreviations(): # Sentence with abbreviation that should not split sentences text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was sunny." codeflash_output = sentence_count(text) # 117μs -> 61.3μs (92.3% faster) # ---------------- EDGE TEST CASES ---------------- def test_empty_string(): # Empty string should yield 0 sentences codeflash_output = sentence_count("") # 5.53μs -> 5.59μs (1.02% slower) def test_whitespace_only(): # String with only whitespace codeflash_output = sentence_count(" ") # 5.11μs -> 5.48μs (6.73% slower) def test_no_sentence_ending_punctuation(): # No periods, exclamation or question marks codeflash_output = sentence_count("This is not split into sentences") # 35.0μs -> 7.85μs (346% faster) def test_sentence_with_only_punctuation(): # String with only punctuation marks codeflash_output = sentence_count("!!!...???") # 47.1μs -> 41.7μs (12.9% faster) def test_sentence_with_newlines(): # Sentences split by newlines text = "First sentence.\nSecond sentence.\n\nThird sentence." codeflash_output = sentence_count(text) # 98.5μs -> 47.0μs (110% faster) def test_sentence_with_multiple_spaces(): # Sentences separated by multiple spaces text = "Sentence one. Sentence two. Sentence three." codeflash_output = sentence_count(text) # 95.3μs -> 42.1μs (126% faster) def test_sentence_with_unicode_punctuation(): # Sentences with unicode punctuation (em dash, ellipsis, etc.) text = "Hello… How are you—good?" codeflash_output = sentence_count(text) # 32.4μs -> 11.0μs (196% faster) def test_sentence_with_non_ascii_characters(): # Sentences with non-ASCII (e.g., accented) characters text = "C'est la vie. Voilà!" codeflash_output = sentence_count(text) # 73.6μs -> 34.4μs (114% faster) def test_sentence_with_numbers_and_periods(): # Numbers with periods should not split sentences text = "Version 3.2 is out. Please update." codeflash_output = sentence_count(text) # 66.2μs -> 29.1μs (128% faster) def test_sentence_with_emoji(): # Sentences with emoji text = "I am happy 😊. Are you?" codeflash_output = sentence_count(text) # 73.7μs -> 33.8μs (118% faster) def test_sentence_with_tabs_and_spaces(): # Sentences separated by tabs and spaces text = "First sentence.\tSecond sentence. Third sentence." codeflash_output = sentence_count(text) # 45.6μs -> 44.1μs (3.20% faster) def test_sentence_with_min_length_zero(): # min_length=0 should count all sentences text = "One. Two. Three." codeflash_output = sentence_count(text, min_length=0) # 88.5μs -> 40.5μs (119% faster) def test_sentence_with_min_length_equals_num_words(): # min_length equal to the number of words in a sentence text = "This is five words." codeflash_output = sentence_count(text, min_length=5) # 31.6μs -> 12.3μs (157% faster) def test_sentence_with_min_length_greater_than_any_sentence(): # min_length greater than any sentence's word count text = "Short. Tiny. Small." codeflash_output = sentence_count(text, min_length=10) # 79.2μs -> 47.8μs (65.7% faster) def test_sentence_with_trailing_and_leading_spaces(): # Sentences with leading/trailing spaces text = " First sentence. Second sentence. " codeflash_output = sentence_count(text) # 50.5μs -> 29.3μs (72.2% faster) def test_sentence_with_only_newlines(): # Only newlines text = "\n\n\n" codeflash_output = sentence_count(text) # 5.11μs -> 5.40μs (5.31% slower) def test_sentence_with_multiple_punctuation_marks(): # Sentences ending with multiple punctuation marks text = "What?! Really?! Yes." codeflash_output = sentence_count(text) # 99.1μs -> 49.6μs (99.9% faster) def test_sentence_with_quoted_text(): # Sentences with quoted text text = '"Hello there." She said. "How are you?"' codeflash_output = sentence_count(text) # 97.5μs -> 58.4μs (66.9% faster) def test_sentence_with_parentheses(): # Sentences with parentheses text = "This is a sentence (with extra info). Another sentence." codeflash_output = sentence_count(text) # 77.1μs -> 32.4μs (138% faster) def test_sentence_with_semicolons_and_colons(): # Semicolons and colons should not split sentences text = "First part; still same sentence: more info. Next sentence." codeflash_output = sentence_count(text) # 72.1μs -> 28.5μs (153% faster) def test_sentence_with_single_word(): # Single word, with and without punctuation codeflash_output = sentence_count("Hello.") # 23.6μs -> 7.43μs (218% faster) codeflash_output = sentence_count("Hello") # 3.91μs -> 5.14μs (24.0% slower) def test_sentence_with_multiple_periods(): # Ellipsis should not split into multiple sentences text = "Wait... What happened?" codeflash_output = sentence_count(text) # 52.3μs -> 29.7μs (76.4% faster) def test_sentence_with_uppercase_acronyms(): # Acronyms with periods should not split sentences text = "I work at U.S.A. headquarters. It's nice." codeflash_output = sentence_count(text) # 90.6μs -> 50.5μs (79.4% faster) def test_sentence_with_decimal_numbers(): # Decimal numbers should not split sentences text = "The value is 3.14. That's pi." codeflash_output = sentence_count(text) # 75.2μs -> 35.4μs (112% faster) def test_sentence_with_bullet_points(): # Bullet points without ending punctuation text = "• First item\n• Second item\n• Third item" codeflash_output = sentence_count(text) # 35.3μs -> 10.4μs (240% faster) def test_sentence_with_dash_and_hyphen(): # Dashes and hyphens should not split sentences text = "Well-known fact—it's true. Next sentence." codeflash_output = sentence_count(text) # 55.1μs -> 35.1μs (56.9% faster) # ---------------- LARGE SCALE TEST CASES ---------------- def test_large_text_many_sentences(): # Test with a large number of sentences text = " ".join([f"Sentence number {i}." for i in range(1, 501)]) codeflash_output = sentence_count(text) # 11.5ms -> 4.39ms (162% faster) def test_large_text_with_min_length(): # Large text, only some sentences meet min_length text = "Short. " * 500 + "This is a sufficiently long sentence for counting. " * 200 # Only the long sentences (7 words) should be counted codeflash_output = sentence_count(text, min_length=7) # 6.24ms -> 6.19ms (0.716% faster) def test_large_text_no_sentences(): # Large text with no sentence-ending punctuation text = " ".join(["word"] * 1000) codeflash_output = sentence_count(text) # 1.12ms -> 27.9μs (3935% faster) def test_large_text_all_sentences_filtered_by_min_length(): # All sentences too short for min_length text = "A. B. C. D. " * 250 codeflash_output = sentence_count(text, min_length=5) # 7.01ms -> 6.93ms (1.12% faster) def test_large_text_with_varied_sentence_lengths(): # Mix of short and long sentences short = "Hi. " * 300 long = "This is a longer sentence for testing. " * 100 text = short + long codeflash_output = sentence_count(text, min_length=6) # 3.39ms -> 3.33ms (1.73% faster) def test_large_text_with_unicode_and_emoji(): # Large text with unicode and emoji in sentences text = "😊 Hello world! " * 400 + "C'est la vie. Voilà! " * 100 codeflash_output = sentence_count(text) # 5.41ms -> 5.16ms (4.84% faster) def test_large_text_with_newlines_and_tabs(): # Large text with newlines and tabs between sentences text = "\n".join([f"Sentence {i}.\t" for i in range(1, 501)]) codeflash_output = sentence_count(text) # 11.0ms -> 4.52ms (144% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-sentence_count-mcglwwcn` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
1 parent 57cadf8 commit 51425dd

File tree

2 files changed

+12
-9
lines changed

2 files changed

+12
-9
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
## 0.18.14-dev0
22

33
### Enhancements
4+
- Speed up function sentence_count by 59% (codeflash)
45

56
- Speed up function `check_for_nltk_package` by 111% (codeflash)
67
- Speed up function `under_non_alpha_ratio` by 76% (codeflash)

unstructured/partition/text_type.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -219,15 +219,17 @@ def sentence_count(text: str, min_length: Optional[int] = None) -> int:
219219
sentences = sent_tokenize(text)
220220
count = 0
221221
for sentence in sentences:
222-
sentence = remove_punctuation(sentence)
223-
words = [word for word in word_tokenize(sentence) if word != "."]
224-
if min_length and len(words) < min_length:
225-
trace_logger.detail( # type: ignore
226-
f"Sentence does not exceed {min_length} word tokens, it will not count toward "
227-
"sentence count.\n"
228-
f"{sentence}",
229-
)
230-
continue
222+
stripped = remove_punctuation(sentence)
223+
# Fast token count after punctuation is removed: just split on whitespace
224+
if min_length:
225+
word_count = sum(1 for token in stripped.split() if token != ".")
226+
if word_count < min_length:
227+
trace_logger.detail( # type: ignore
228+
f"Sentence does not exceed {min_length} word tokens, it will not count toward "
229+
"sentence count.\n"
230+
f"{stripped}",
231+
)
232+
continue
231233
count += 1
232234
return count
233235

0 commit comments

Comments
 (0)