Skip to content

A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

License

Notifications You must be signed in to change notification settings

isaacus-dev/semchunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

semchunk

semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Installation 📦

semchunk may be installed with pip:

pip install semchunk

Usage 👩‍💻

The code snippet below demonstrates how text can be chunked with semchunk:

>>> import semchunk
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
) -> list[str]

chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

Methodology 🔬

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

  1. Splits text using the most semantically meaningful splitter possible;
  2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
  3. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
  4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

  1. The largest sequence of newlines (\n) and/or carriage returns (\r);
  2. The largest sequence of tabs;
  3. The largest sequence of whitespace characters (as defined by regex's \s character class);
  4. Sentence terminators (., ?, ! and *);
  5. Clause separators (;, ,, (, ), [, ], , , , , ', " and `);
  6. Sentence interrupters (:, and );
  7. Word joiners (/, \, , & and -); and
  8. All other characters.

Licence 📄

This library is licensed under the MIT License.