Skip to content

Conversation

@misrasaurabh1
Copy link
Contributor

Saurabh's comments - The changes look good, especially because they have been rigorously tested with a variety of cases, which makes me feel confident

📄 59% (0.59x) speedup for sentence_count in unstructured/partition/text_type.py

⏱️ Runtime : 190 milliseconds 119 milliseconds (best of 39 runs)

📝 Explanation and details

Major speedups.

  • Replace list comprehensions with generator expressions in counting scenarios to avoid building intermediate lists.
  • Use a simple word count (split by space or with str.split()) after punctuation removal, rather than expensive word_tokenize call, since only token count is used and punctuation is already stripped.
  • Avoid calling remove_punctuation and word_tokenize on already very short sentences if there's a min_length filter: filter quickly if text length is zero.

If you wish to maximize compatibility with sentences containing non-whitespace-separable tokens (e.g. CJK languages), consider further optimization on the token counting line as needed for your domain. Otherwise, str.split() after punctuation removal suffices and is far faster than a full NLP tokenizer.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 21 Passed
🌀 Generated Regression Tests 92 Passed
⏪ Replay Tests 695 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/test_text_type.py::test_item_titles 92.5μs 47.8μs ✅93.6%
partition/test_text_type.py::test_sentence_count 47.8μs 4.67μs ✅924%
test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count 4.37ms 2.11ms ✅107%
test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_sentence_count 13.0ms 4.69ms ✅177%
test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_sentence_count 5.29ms 3.02ms ✅75.3%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import random
import string
import sys
import unicodedata
from functools import lru_cache
from typing import Final, List, Optional

# imports
import pytest  # used for our unit tests
from nltk import sent_tokenize as _sent_tokenize
from nltk import word_tokenize as _word_tokenize
from unstructured.cleaners.core import remove_punctuation
from unstructured.logger import trace_logger
from unstructured.nlp.tokenize import sent_tokenize, word_tokenize
from unstructured.partition.text_type import sentence_count

# unit tests

# --------------------
# BASIC TEST CASES
# --------------------

def test_empty_string():
    # Empty string should return 0 sentences
    codeflash_output = sentence_count("") # 7.66μs -> 6.98μs (9.74% faster)

def test_single_sentence():
    # Single sentence with period
    codeflash_output = sentence_count("This is a sentence.") # 36.7μs -> 9.46μs (288% faster)

def test_single_sentence_no_punctuation():
    # Single sentence, no punctuation (should still count as 1 by NLTK)
    codeflash_output = sentence_count("This is a sentence") # 29.5μs -> 7.63μs (286% faster)

def test_two_sentences():
    # Two sentences separated by period
    codeflash_output = sentence_count("This is one. This is two.") # 72.4μs -> 37.0μs (95.7% faster)

def test_multiple_sentences_with_various_punctuation():
    # Sentences ending with ! and ?
    codeflash_output = sentence_count("Is this working? Yes! It is.") # 93.2μs -> 44.5μs (109% faster)

def test_sentence_with_abbreviation():
    # Abbreviations shouldn't split sentences
    codeflash_output = sentence_count("Dr. Smith went home. He was tired.") # 80.0μs -> 42.3μs (89.0% faster)

def test_sentence_with_ellipsis():
    # Ellipsis should not split sentences
    codeflash_output = sentence_count("Wait... what happened? I don't know.") # 76.9μs -> 42.5μs (81.0% faster)

def test_sentence_with_newlines():
    # Sentences separated by newlines
    codeflash_output = sentence_count("First sentence.\nSecond sentence.\nThird sentence.") # 91.7μs -> 43.4μs (111% faster)

def test_sentence_with_min_length_met():
    # min_length is met for all sentences
    codeflash_output = sentence_count("One two three. Four five six.", min_length=2) # 63.1μs -> 30.2μs (109% faster)

def test_sentence_with_min_length_not_met():
    # Only one sentence meets min_length
    codeflash_output = sentence_count("One. Two three four.", min_length=3) # 62.8μs -> 31.9μs (97.1% faster)

def test_sentence_with_min_length_none_met():
    # No sentence meets min_length
    codeflash_output = sentence_count("A. B.", min_length=2) # 60.9μs -> 32.3μs (88.5% faster)

def test_sentence_with_min_length_equals_length():
    # Sentence with exactly min_length words
    codeflash_output = sentence_count("One two three.", min_length=3) # 27.3μs -> 9.54μs (187% faster)

def test_sentence_with_trailing_space():
    # Sentence with trailing spaces
    codeflash_output = sentence_count("Hello world.   ") # 27.8μs -> 8.60μs (223% faster)

# --------------------
# EDGE TEST CASES
# --------------------

def test_only_punctuation():
    # Only punctuation, no words
    codeflash_output = sentence_count("...!!!") # 39.1μs -> 33.5μs (16.8% faster)

def test_only_whitespace():
    # Only whitespace
    codeflash_output = sentence_count("    \n\t   ") # 5.21μs -> 5.49μs (5.05% slower)

def test_sentence_with_numbers_and_symbols():
    # Sentence with numbers and symbols
    codeflash_output = sentence_count("12345! $%^&*()") # 66.6μs -> 32.7μs (104% faster)

def test_sentence_with_unicode_characters():
    # Sentences with unicode and emoji
    codeflash_output = sentence_count("Hello 😊. How are you?") # 75.9μs -> 37.7μs (102% faster)

def test_sentence_with_mixed_scripts():
    # Sentences with mixed scripts (e.g., English and Japanese)
    codeflash_output = sentence_count("Hello. こんにちは。How are you?") # 71.2μs -> 34.9μs (104% faster)

def test_sentence_with_multiple_spaces():
    # Sentences with irregular spacing
    codeflash_output = sentence_count("This   is   spaced.   And   so   is   this.") # 69.5μs -> 30.3μs (129% faster)

def test_sentence_with_no_word_characters():
    # Only punctuation and numbers
    codeflash_output = sentence_count("... 123 ...") # 42.1μs -> 25.5μs (65.2% faster)

def test_sentence_with_long_word():
    # Sentence with a single long word
    long_word = "a" * 100
    codeflash_output = sentence_count(f"{long_word}.") # 42.0μs -> 7.55μs (457% faster)

def test_sentence_with_long_word_and_min_length():
    # Sentence with long word, min_length > 1
    long_word = "a" * 100
    codeflash_output = sentence_count(f"{long_word}.", min_length=2) # 43.1μs -> 10.7μs (303% faster)

def test_sentence_with_only_abbreviation():
    # Sentence is only an abbreviation
    codeflash_output = sentence_count("U.S.A.") # 23.0μs -> 7.46μs (208% faster)

def test_sentence_with_nonbreaking_space():
    # Sentence with non-breaking space
    text = "Hello\u00A0world. How are you?"
    codeflash_output = sentence_count(text) # 74.4μs -> 37.3μs (99.6% faster)

def test_sentence_with_tab_characters():
    # Sentences separated by tabs
    text = "Hello world.\tHow are you?\tFine."
    codeflash_output = sentence_count(text) # 100μs -> 44.5μs (125% faster)

def test_sentence_with_multiple_punctuation_marks():
    # Sentences ending with multiple punctuation marks
    text = "Wait!! What?? Really..."
    codeflash_output = sentence_count(text) # 88.9μs -> 48.3μs (83.9% faster)

def test_sentence_with_leading_and_trailing_punctuation():
    # Sentence surrounded by punctuation
    text = "...Hello world!..."
    codeflash_output = sentence_count(text) # 25.9μs -> 8.64μs (200% faster)

def test_sentence_with_quotes():
    # Sentences with quotes
    text = '"Hello," she said. "How are you?"'
    codeflash_output = sentence_count(text) # 85.3μs -> 46.6μs (83.1% faster)

def test_sentence_with_parentheses():
    # Sentence with parentheses
    text = "This is a sentence (with parentheses). This is another."
    codeflash_output = sentence_count(text) # 74.7μs -> 33.5μs (123% faster)

def test_sentence_with_semicolons():
    # Semicolons should not split sentences
    text = "This is a sentence; this is not a new sentence."
    codeflash_output = sentence_count(text) # 37.1μs -> 8.74μs (324% faster)

def test_sentence_with_colons():
    # Colons should not split sentences
    text = "This is a sentence: it continues here."
    codeflash_output = sentence_count(text) # 33.2μs -> 8.44μs (294% faster)

def test_sentence_with_dash():
    # Dashes should not split sentences
    text = "This is a sentence - it continues here."
    codeflash_output = sentence_count(text) # 34.8μs -> 8.52μs (309% faster)

def test_sentence_with_multiple_dots():
    # Multiple dots but not ellipsis
    text = "This is a sentence.... This is another."
    codeflash_output = sentence_count(text) # 79.9μs -> 38.2μs (109% faster)

def test_sentence_with_min_length_and_punctuation():
    # min_length with sentences containing only punctuation
    text = "!!! ... ???"
    codeflash_output = sentence_count(text, min_length=1) # 80.2μs -> 77.0μs (4.21% faster)

def test_sentence_with_min_length_and_numbers():
    # min_length with numbers as words
    text = "1 2 3 4. 5 6."
    codeflash_output = sentence_count(text, min_length=4) # 69.6μs -> 34.6μs (101% faster)

def test_sentence_with_min_length_and_unicode():
    # min_length with unicode
    text = "😊 😊 😊 😊. Hello!"
    codeflash_output = sentence_count(text, min_length=4) # 76.1μs -> 41.7μs (82.5% faster)

def test_sentence_with_non_ascii_punctuation():
    # Sentence with non-ASCII punctuation (e.g., Chinese full stop)
    text = "Hello world。How are you?"
    codeflash_output = sentence_count(text) # 31.9μs -> 9.34μs (242% faster)

def test_sentence_with_repeated_newlines():
    # Sentences separated by multiple newlines
    text = "First sentence.\n\n\nSecond sentence."
    codeflash_output = sentence_count(text) # 71.9μs -> 33.4μs (115% faster)

# --------------------
# LARGE SCALE TEST CASES
# --------------------

def test_large_number_of_sentences():
    # 1000 sentences, each "Sentence X."
    n = 1000
    text = " ".join([f"Sentence {i}." for i in range(n)])
    codeflash_output = sentence_count(text) # 21.2ms -> 8.43ms (151% faster)

def test_large_number_of_sentences_with_min_length():
    # 1000 sentences, every even-indexed has 3 words, odd-indexed has 1 word
    n = 1000
    sentences = []
    for i in range(n):
        if i % 2 == 0:
            sentences.append(f"Word1 Word2 Word3.")
        else:
            sentences.append(f"Word.")
    text = " ".join(sentences)
    # Only even-indexed sentences should count for min_length=3
    codeflash_output = sentence_count(text, min_length=3)

def test_large_sentence():
    # One very long sentence (999 words)
    sentence = " ".join(["word"] * 999) + "."
    codeflash_output = sentence_count(sentence) # 1.15ms -> 29.0μs (3854% faster)
    codeflash_output = sentence_count(sentence, min_length=999) # 18.7μs -> 41.1μs (54.5% slower)
    codeflash_output = sentence_count(sentence, min_length=1000) # 18.4μs -> 38.6μs (52.3% slower)

def test_large_text_with_varied_sentence_lengths():
    # 500 short sentences, 500 long sentences (5 and 20 words)
    n_short = 500
    n_long = 500
    short_sentence = "a b c d e."
    long_sentence = " ".join(["word"] * 20) + "."
    text = " ".join([short_sentence]*n_short + [long_sentence]*n_long)
    # min_length=10 should only count long sentences
    codeflash_output = sentence_count(text, min_length=10) # 8.33ms -> 7.46ms (11.7% faster)
    # min_length=1 should count all
    codeflash_output = sentence_count(text, min_length=1) # 412μs -> 722μs (42.9% slower)

def test_large_text_with_unicode_and_punctuation():
    # 1000 sentences, each with emoji and punctuation
    n = 1000
    text = " ".join([f"Hello 😊! How are you?"] * n)
    # Each repetition has 2 sentences
    codeflash_output = sentence_count(text) # 15.8ms -> 15.3ms (3.27% faster)

def test_large_text_with_random_punctuation():
    # 1000 sentences with random punctuation at the end
    n = 1000
    punctuations = [".", "!", "?"]
    text = " ".join([f"Sentence {i}{random.choice(punctuations)}" for i in range(n)])
    codeflash_output = sentence_count(text) # 20.6ms -> 8.02ms (157% faster)

def test_large_text_with_abbreviations():
    # 1000 sentences, some with abbreviations
    n = 1000
    text = " ".join([f"Dr. Smith went home. He was tired."] * (n // 2))
    # Each repetition has 2 sentences
    codeflash_output = sentence_count(text) # 11.6ms -> 11.2ms (3.99% faster)

def test_large_text_with_newlines_and_tabs():
    # 500 sentences separated by newlines, 500 by tabs
    n = 500
    text1 = "\n".join([f"Sentence {i}." for i in range(n)])
    text2 = "\t".join([f"Sentence {i}." for i in range(n, 2*n)])
    text = text1 + "\n" + text2
    codeflash_output = sentence_count(text) # 21.3ms -> 8.57ms (149% faster)

def test_large_text_with_min_length_and_unicode():
    # 1000 sentences, half with 5 emojis, half with 1 emoji
    n = 1000
    text = " ".join(["😊 " * 5 + "." if i % 2 == 0 else "😊." for i in range(n)])
    # min_length=5 should count only even-indexed
    codeflash_output = sentence_count(text, min_length=5) # 7.88ms -> 7.83ms (0.679% faster)
    # min_length=1 should count all
    codeflash_output = sentence_count(text, min_length=1) # 573μs -> 742μs (22.7% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import string
import sys
import unicodedata
from functools import lru_cache
from typing import Final, List, Optional

# imports
import pytest  # used for our unit tests
from nltk import sent_tokenize as _sent_tokenize
from nltk import word_tokenize as _word_tokenize
from unstructured.partition.text_type import sentence_count


# Dummy trace_logger for test purposes (since real logger is not available)
class DummyLogger:
    def detail(self, msg):
        pass
trace_logger = DummyLogger()
from unstructured.partition.text_type import sentence_count

# unit tests

# ---------------- BASIC TEST CASES ----------------

def test_single_sentence():
    # A simple sentence
    codeflash_output = sentence_count("This is a sentence.") # 34.7μs -> 10.1μs (244% faster)

def test_multiple_sentences():
    # Two distinct sentences
    codeflash_output = sentence_count("This is the first sentence. This is the second.") # 77.6μs -> 33.9μs (129% faster)

def test_sentence_with_min_length_met():
    # Sentence with enough words for min_length
    codeflash_output = sentence_count("This is a long enough sentence.", min_length=5) # 33.7μs -> 10.9μs (209% faster)

def test_sentence_with_min_length_not_met():
    # Sentence with too few words for min_length
    codeflash_output = sentence_count("Too short.", min_length=3) # 28.4μs -> 11.1μs (156% faster)

def test_multiple_sentences_with_min_length():
    # Only one of two sentences meets min_length
    text = "Short. This one is long enough."
    codeflash_output = sentence_count(text, min_length=4) # 73.7μs -> 37.0μs (99.1% faster)

def test_sentence_with_punctuation():
    # Sentence with internal punctuation
    text = "Hello, world! How are you?"
    codeflash_output = sentence_count(text) # 66.2μs -> 29.9μs (121% faster)

def test_sentence_with_abbreviations():
    # Sentence with abbreviation that should not split sentences
    text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was sunny."
    codeflash_output = sentence_count(text) # 117μs -> 61.3μs (92.3% faster)

# ---------------- EDGE TEST CASES ----------------

def test_empty_string():
    # Empty string should yield 0 sentences
    codeflash_output = sentence_count("") # 5.53μs -> 5.59μs (1.02% slower)

def test_whitespace_only():
    # String with only whitespace
    codeflash_output = sentence_count("   ") # 5.11μs -> 5.48μs (6.73% slower)

def test_no_sentence_ending_punctuation():
    # No periods, exclamation or question marks
    codeflash_output = sentence_count("This is not split into sentences") # 35.0μs -> 7.85μs (346% faster)

def test_sentence_with_only_punctuation():
    # String with only punctuation marks
    codeflash_output = sentence_count("!!!...???") # 47.1μs -> 41.7μs (12.9% faster)

def test_sentence_with_newlines():
    # Sentences split by newlines
    text = "First sentence.\nSecond sentence.\n\nThird sentence."
    codeflash_output = sentence_count(text) # 98.5μs -> 47.0μs (110% faster)

def test_sentence_with_multiple_spaces():
    # Sentences separated by multiple spaces
    text = "Sentence one.   Sentence two.      Sentence three."
    codeflash_output = sentence_count(text) # 95.3μs -> 42.1μs (126% faster)

def test_sentence_with_unicode_punctuation():
    # Sentences with unicode punctuation (em dash, ellipsis, etc.)
    text = "Hello… How are you—good?"
    codeflash_output = sentence_count(text) # 32.4μs -> 11.0μs (196% faster)

def test_sentence_with_non_ascii_characters():
    # Sentences with non-ASCII (e.g., accented) characters
    text = "C'est la vie. Voilà!"
    codeflash_output = sentence_count(text) # 73.6μs -> 34.4μs (114% faster)

def test_sentence_with_numbers_and_periods():
    # Numbers with periods should not split sentences
    text = "Version 3.2 is out. Please update."
    codeflash_output = sentence_count(text) # 66.2μs -> 29.1μs (128% faster)

def test_sentence_with_emoji():
    # Sentences with emoji
    text = "I am happy 😊. Are you?"
    codeflash_output = sentence_count(text) # 73.7μs -> 33.8μs (118% faster)

def test_sentence_with_tabs_and_spaces():
    # Sentences separated by tabs and spaces
    text = "First sentence.\tSecond sentence.   Third sentence."
    codeflash_output = sentence_count(text) # 45.6μs -> 44.1μs (3.20% faster)

def test_sentence_with_min_length_zero():
    # min_length=0 should count all sentences
    text = "One. Two. Three."
    codeflash_output = sentence_count(text, min_length=0) # 88.5μs -> 40.5μs (119% faster)

def test_sentence_with_min_length_equals_num_words():
    # min_length equal to the number of words in a sentence
    text = "This is five words."
    codeflash_output = sentence_count(text, min_length=5) # 31.6μs -> 12.3μs (157% faster)

def test_sentence_with_min_length_greater_than_any_sentence():
    # min_length greater than any sentence's word count
    text = "Short. Tiny. Small."
    codeflash_output = sentence_count(text, min_length=10) # 79.2μs -> 47.8μs (65.7% faster)

def test_sentence_with_trailing_and_leading_spaces():
    # Sentences with leading/trailing spaces
    text = "   First sentence. Second sentence.   "
    codeflash_output = sentence_count(text) # 50.5μs -> 29.3μs (72.2% faster)

def test_sentence_with_only_newlines():
    # Only newlines
    text = "\n\n\n"
    codeflash_output = sentence_count(text) # 5.11μs -> 5.40μs (5.31% slower)

def test_sentence_with_multiple_punctuation_marks():
    # Sentences ending with multiple punctuation marks
    text = "What?! Really?! Yes."
    codeflash_output = sentence_count(text) # 99.1μs -> 49.6μs (99.9% faster)

def test_sentence_with_quoted_text():
    # Sentences with quoted text
    text = '"Hello there." She said. "How are you?"'
    codeflash_output = sentence_count(text) # 97.5μs -> 58.4μs (66.9% faster)

def test_sentence_with_parentheses():
    # Sentences with parentheses
    text = "This is a sentence (with extra info). Another sentence."
    codeflash_output = sentence_count(text) # 77.1μs -> 32.4μs (138% faster)

def test_sentence_with_semicolons_and_colons():
    # Semicolons and colons should not split sentences
    text = "First part; still same sentence: more info. Next sentence."
    codeflash_output = sentence_count(text) # 72.1μs -> 28.5μs (153% faster)

def test_sentence_with_single_word():
    # Single word, with and without punctuation
    codeflash_output = sentence_count("Hello.") # 23.6μs -> 7.43μs (218% faster)
    codeflash_output = sentence_count("Hello") # 3.91μs -> 5.14μs (24.0% slower)

def test_sentence_with_multiple_periods():
    # Ellipsis should not split into multiple sentences
    text = "Wait... What happened?"
    codeflash_output = sentence_count(text) # 52.3μs -> 29.7μs (76.4% faster)

def test_sentence_with_uppercase_acronyms():
    # Acronyms with periods should not split sentences
    text = "I work at U.S.A. headquarters. It's nice."
    codeflash_output = sentence_count(text) # 90.6μs -> 50.5μs (79.4% faster)

def test_sentence_with_decimal_numbers():
    # Decimal numbers should not split sentences
    text = "The value is 3.14. That's pi."
    codeflash_output = sentence_count(text) # 75.2μs -> 35.4μs (112% faster)

def test_sentence_with_bullet_points():
    # Bullet points without ending punctuation
    text = "• First item\n• Second item\n• Third item"
    codeflash_output = sentence_count(text) # 35.3μs -> 10.4μs (240% faster)

def test_sentence_with_dash_and_hyphen():
    # Dashes and hyphens should not split sentences
    text = "Well-known fact—it's true. Next sentence."
    codeflash_output = sentence_count(text) # 55.1μs -> 35.1μs (56.9% faster)

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_text_many_sentences():
    # Test with a large number of sentences
    text = " ".join([f"Sentence number {i}." for i in range(1, 501)])
    codeflash_output = sentence_count(text) # 11.5ms -> 4.39ms (162% faster)

def test_large_text_with_min_length():
    # Large text, only some sentences meet min_length
    text = "Short. " * 500 + "This is a sufficiently long sentence for counting. " * 200
    # Only the long sentences (7 words) should be counted
    codeflash_output = sentence_count(text, min_length=7) # 6.24ms -> 6.19ms (0.716% faster)

def test_large_text_no_sentences():
    # Large text with no sentence-ending punctuation
    text = " ".join(["word"] * 1000)
    codeflash_output = sentence_count(text) # 1.12ms -> 27.9μs (3935% faster)

def test_large_text_all_sentences_filtered_by_min_length():
    # All sentences too short for min_length
    text = "A. B. C. D. " * 250
    codeflash_output = sentence_count(text, min_length=5) # 7.01ms -> 6.93ms (1.12% faster)

def test_large_text_with_varied_sentence_lengths():
    # Mix of short and long sentences
    short = "Hi. " * 300
    long = "This is a longer sentence for testing. " * 100
    text = short + long
    codeflash_output = sentence_count(text, min_length=6) # 3.39ms -> 3.33ms (1.73% faster)

def test_large_text_with_unicode_and_emoji():
    # Large text with unicode and emoji in sentences
    text = "😊 Hello world! " * 400 + "C'est la vie. Voilà! " * 100
    codeflash_output = sentence_count(text) # 5.41ms -> 5.16ms (4.84% faster)

def test_large_text_with_newlines_and_tabs():
    # Large text with newlines and tabs between sentences
    text = "\n".join([f"Sentence {i}.\t" for i in range(1, 501)])
    codeflash_output = sentence_count(text) # 11.0ms -> 4.52ms (144% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-sentence_count-mcglwwcn and push.

Codeflash

codeflash-ai bot and others added 5 commits June 28, 2025 19:01
Here is your optimized code.
Major speedups.
- Replace list comprehensions with generator expressions in counting scenarios to avoid building intermediate lists.
- Use a simple word count (split by space or with str.split()) after punctuation removal, rather than expensive word_tokenize call, since only token count is used and punctuation is already stripped.
- Avoid calling remove_punctuation and word_tokenize on already very short sentences if there's a min_length filter: filter quickly if text length is zero.
- Remove unnecessary import of sent_tokenize and word_tokenize from **unstructured.nlp.tokenize** since we shadow them with the local definitions.

**All docstrings and core function signatures are preserved.**
**All external calls and logging are preserved.**
**All comments are preserved unless implementation has changed.**



**Summary of key changes for speed:**
- Stop double-importing and shadowing tokenize functions.
- Use `str.split()` instead of `word_tokenize` after removing punctuation when only the number of tokens is needed, which is far faster.
- Eliminate creation of temporary word lists purely for counting.
- Only call remove_punctuation once per sentence per iteration.

If you wish to maximize compatibility with sentences containing non-whitespace-separable tokens (e.g. CJK languages), consider further optimization on the token counting line as needed for your domain. Otherwise, `str.split()` after punctuation removal suffices and is far faster than a full NLP tokenizer.

Let me know if you need a further optimized/remixed version!
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
@cragwolfe cragwolfe added this pull request to the merge queue Aug 22, 2025
Merged via the queue into Unstructured-IO:main with commit 51425dd Aug 22, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants