Name	Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows	.github/workflows
src/semchunk	src/semchunk
tests	tests
CHANGELOG.md	CHANGELOG.md
LICENCE	LICENCE
README.md	README.md
pyproject.toml	pyproject.toml

Name

Last commit message

Last commit date

8 Commits

semchunk

semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Installation 📦

semchunk may be installed with pip:

pip install semchunk

Usage 👩‍💻

The code snippet below demonstrates how text can be chunked with semchunk:

>>> import semchunk
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
) -> list[str]

chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

Methodology 🔬

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

Splits text using the most semantically meaningful splitter possible;
Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
Merges any chunks that are under the chunk size back together until the chunk size is reached; and
Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

The largest sequence of newlines (\n) and/or carriage returns (\r);
The largest sequence of tabs;
The largest sequence of whitespace characters (as defined by regex's \s character class);
Sentence terminators (., ?, ! and *);
Clause separators (;, ,, (, ), [, ], “, ”, ‘, ’, ', " and `);
Sentence interrupters (:, — and …);
Word joiners (/, \, –, & and -); and
All other characters.

Licence 📄

This library is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semchunk

Installation 📦

Usage 👩‍💻

Chunk

Methodology 🔬

Licence 📄

About

Releases 28

Packages

Used by 416

Contributors 4

Languages

License

isaacus-dev/semchunk

Folders and files

Latest commit

History

Repository files navigation

semchunk

Installation 📦

Usage 👩‍💻

Chunk

Methodology 🔬

Licence 📄

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 28

Packages 0

Used by 416

Contributors 4

Languages

Packages