Releases: isaacus-dev/semchunk
Releases · isaacus-dev/semchunk
v3.2.2
v3.2.1
Fixed
- Fixed minor typos in the README and docstrings.
v3.2.0
Changed
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the
semchunk
algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general (#17).
v3.1.3
v3.1.2
Changed
- Changed test model from
isaacus/emubert
toisaacus/kanon-tokenizer
.
Full Changelog: v3.1.1...v3.1.2
v3.1.1
Added
- Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.
3.1.0
Added
- Introduced a new
cache_maxsize
argument tochunkerify()
andchunk()
that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults toNone
, in which case the cache is unbounded.
v3.0.4
Fixed
- Fixed bug where attempting to chunk only whitespace characters would raise
ValueError: not enough values to unpack (expected 2, got 0)
(ScrapeGraphAI/Scrapegraph-ai#893).
v3.0.3
Fixed
- Fixed
isaacus/emubert
mistakenly being set toisaacus-dev/emubert
in the README and tests.