Releases: isaacus-dev/semchunk
Releases · isaacus-dev/semchunk
v3.2.0
Changed
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the
semchunk
algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general (#17).
v3.1.3
v3.1.2
Changed
- Changed test model from
isaacus/emubert
toisaacus/kanon-tokenizer
.
Full Changelog: v3.1.1...v3.1.2
v3.1.1
Added
- Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.
3.1.0
Added
- Introduced a new
cache_maxsize
argument tochunkerify()
andchunk()
that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults toNone
, in which case the cache is unbounded.
v3.0.4
Fixed
- Fixed bug where attempting to chunk only whitespace characters would raise
ValueError: not enough values to unpack (expected 2, got 0)
(ScrapeGraphAI/Scrapegraph-ai#893).
v3.0.3
Fixed
- Fixed
isaacus/emubert
mistakenly being set toisaacus-dev/emubert
in the README and tests.
v3.0.2
v3.0.1
Fixed
- Fixed a bug where attempting to chunk an empty text would raise a
ValueError
.
v3.0.0
Added
- Added an
offsets
argument tochunk()
andChunker.__call__()
that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults toFalse
. - Added an
overlap
argument tochunk()
andChunker.__call__()
that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults toNone
, in which case no overlapping occurs. - Added an undocumented, private
_make_chunk_function()
method to theChunker
class that constructs chunking functions with call-level arguments passed. - Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.
Changed
- Began removing chunks comprised entirely of whitespace characters from the output of
chunk()
. - Updated
semchunk
's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.
Fixed
- Fixed a typo in the docstring for the
__call__()
method of theChunker
class returned bychunkerify()
where most of the documentation for the arguments were listed under the section for the method's returns.
Removed
- Removed undocumented, private
chunk()
method from theChunker
class returned bychunkerify()
. - Removed undocumented, private
_reattach_whitespace_splitters
argument ofchunk()
that was introduced to experiment with potentially adding support for overlap ratios.