All notable changes to semchunk
will be documented here. This project adheres to Keep a Changelog and Semantic Versioning.
3.2.1 - 2025-03-27
- Fixed minor typos in the README and docstrings.
3.2.0 - 2025-03-20
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the
semchunk
algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general (#17).
3.1.3 - 2025-03-10
- Added mention of Isaacus to the README.
3.1.2 - 2025-03-06
- Changed test model from
isaacus/emubert
toisaacus/kanon-tokenizer
.
3.1.1 - 2025-02-18
- Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.
3.1.0 - 2025-02-16
- Introduced a new
cache_maxsize
argument tochunkerify()
andchunk()
that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults toNone
, in which case the cache is unbounded.
3.0.4 - 2025-02-14
- Fixed bug where attempting to chunk only whitespace characters would raise
ValueError: not enough values to unpack (expected 2, got 0)
(ScrapeGraphAI/Scrapegraph-ai#893).
3.0.3 - 2025-02-13
- Fixed
isaacus/emubert
mistakenly being set toisaacus-dev/emubert
in the README and tests.
3.0.2 - 2025-02-13
- Significantly sped up chunking very long texts with little to no variation in levels of whitespace used (fixing #8) and, in the process, also slightly improved overall performance.
- Transferred
semchunk
to Isaacus. - Began formatting with Ruff.
3.0.1 - 2024-01-10
- Fixed a bug where attempting to chunk an empty text would raise a
ValueError
.
3.0.0 - 2024-12-31
- Added an
offsets
argument tochunk()
andChunker.__call__()
that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults toFalse
. - Added an
overlap
argument tochunk()
andChunker.__call__()
that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults toNone
, in which case no overlapping occurs. - Added an undocumented, private
_make_chunk_function()
method to theChunker
class that constructs chunking functions with call-level arguments passed. - Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.
- Began removing chunks comprised entirely of whitespace characters from the output of
chunk()
. - Updated
semchunk
's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.
- Fixed a typo in the docstring for the
__call__()
method of theChunker
class returned bychunkerify()
where most of the documentation for the arguments were listed under the section for the method's returns.
- Removed undocumented, private
chunk()
method from theChunker
class returned bychunkerify()
. - Removed undocumented, private
_reattach_whitespace_splitters
argument ofchunk()
that was introduced to experiment with potentially adding support for overlap ratios.
2.2.2 - 2024-12-18
- Ensured
hatch
does not include irrelevant files in the distribution.
2.2.1 - 2024-12-17
- Started benchmarking
semantic-text-splitter
in parallel to ensure a fair comparison, courtesy of @benbrandt (#17).
2.2.0 - 2024-07-12
- Switched from having
chunkerify()
output a function to having it return an instance of the newChunker()
class which should not alter functionality in any way but will allow for the preservation of type hints, fixing #7.
2.1.0 - 2024-06-20
- Ceased memoizing
chunk()
(but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.
2.0.0 - 2024-06-19
- Added support for multiprocessing through the
processes
argument passable to chunkers constructed bychunkerify()
.
- No longer guaranteed that
semchunk
is pure Python.
1.0.1 - 2024-06-02
- Documented the
progress
argument in the docstring forchunkerify()
and its type hint in the README.
1.0.0 - 2024-06-02
- Added a
progress
argument to the chunker returned bychunkerify()
that, when set toTrue
and multiple texts are passed, displays a progress bar.
0.3.2 - 2024-06-01
- Fixed a bug where a
DivisionByZeroError
would be raised where a token counter returned zero tokens when called frommerge_splits()
, courtesy of @jcobol (#5) (7fd64eb), fixing #4.
0.3.1 - 2024-05-18
- Fixed typo in error messages in
chunkerify()
where it was referred to asmake_chunker()
.
0.3.0 - 2024-05-18
- Introduced the
chunkerify()
function, which constructs a chunker from a tokenizer or token counter that can be reused and can also chunk multiple texts in a single call. The resulting chunker speeds up chunking by 40.4% thanks, in large part, to a token counter that avoid having to count the number of tokens in a text when the number of characters in the text exceed a certain threshold, courtesy of @R0bk (#3) (337a186).
0.2.4 - 2024-05-13
- Improved chunking performance with larger chunk sizes by switching from linear to binary search for the identification of optimal chunk boundaries, courtesy of @R0bk (#3) (337a186).
0.2.3 - 2024-03-11
- Ensured that memoization does not overwrite
chunk()
's function signature.
0.2.2 - 2024-02-05
- Ensured that the
memoize
argument is passed back tochunk()
in recursive calls.
0.2.1 - 2023-11-09
- Memoized
chunk()
.
- Fixed typos in README.
0.2.0 - 2023-11-07
- Added the
memoize
argument tochunk()
, which memoizes token counters by default to significantly improve performance.
- Improved chunking performance.
0.1.2 - 2023-11-07
- Fixed links in the README.
0.1.1 - 2023-11-07
- Added new test samples.
- Added benchmarks.
- Improved chunking performance.
- improved test coverage.
0.1.0 - 2023-11-05
- Added the
chunk()
function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.