semchunk
is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
semchunk
may be installed with pip
:
pip install semchunk
The code snippet below demonstrates how text can be chunked with semchunk
:
>>> import semchunk
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
def chunk(
text: str,
chunk_size: int,
token_counter: callable,
) -> list[str]
chunk()
splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.
text
is the text to be chunked.
chunk_size
is the maximum number of tokens a chunk may contain.
token_counter
is a callable that takes a string and returns the number of tokens in it.
This function returns a list of chunks up to chunk_size
-tokens-long, with any whitespace used to split the text removed.
semchunk
works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
- Splits text using the most semantically meaningful splitter possible;
- Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
- Merges any chunks that are under the chunk size back together until the chunk size is reached; and
- Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.
To ensure that chunks are as semantically meaningful as possible, semchunk
uses the following splitters, in order of precedence:
- The largest sequence of newlines (
\n
) and/or carriage returns (\r
); - The largest sequence of tabs;
- The largest sequence of whitespace characters (as defined by regex's
\s
character class); - Sentence terminators (
.
,?
,!
and*
); - Clause separators (
;
,,
,(
,)
,[
,]
,“
,”
,‘
,’
,'
,"
and`
); - Sentence interrupters (
:
,—
and…
); - Word joiners (
/
,\
,–
,&
and-
); and - All other characters.
This library is licensed under the MIT License.