Skip to content

Releases: isaacus-dev/semchunk

v3.2.0

20 Mar 04:46
Compare
Choose a tag to compare

Changed

  • Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the semchunk algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general (#17).

v3.1.3

11 Mar 06:17
Compare
Choose a tag to compare

Changed

  • Added mention of Isaacus to the README.

Full Changelog: v3.1.2...v3.1.3

v3.1.2

06 Mar 11:16
Compare
Choose a tag to compare

Changed

  • Changed test model from isaacus/emubert to isaacus/kanon-tokenizer.

Full Changelog: v3.1.1...v3.1.2

v3.1.1

18 Feb 05:02
Compare
Choose a tag to compare

Added

  • Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.

3.1.0

16 Feb 10:14
Compare
Choose a tag to compare

Added

  • Introduced a new cache_maxsize argument to chunkerify() and chunk() that specifies the maximum number of text-token count pairs that can be stored in a token counter's cache. The argument defaults to None, in which case the cache is unbounded.

v3.0.4

13 Feb 23:02
Compare
Choose a tag to compare

Fixed

  • Fixed bug where attempting to chunk only whitespace characters would raise ValueError: not enough values to unpack (expected 2, got 0) (ScrapeGraphAI/Scrapegraph-ai#893).

v3.0.3

13 Feb 05:54
Compare
Choose a tag to compare

Fixed

  • Fixed isaacus/emubert mistakenly being set to isaacus-dev/emubert in the README and tests.

v3.0.2

13 Feb 05:47
Compare
Choose a tag to compare

This release was yanked due to a typo.

Fixed

  • Significantly sped up chunking very long texts with little to no variation in levels of whitespace used (fixing #8) and, in the process, also slightly improved overall performance.

Changed

  • Transferred semchunk to Isaacus.
  • Began formatting with Ruff.

v3.0.1

10 Jan 02:01
Compare
Choose a tag to compare

Fixed

  • Fixed a bug where attempting to chunk an empty text would raise a ValueError.

v3.0.0

31 Dec 04:40
Compare
Choose a tag to compare

Added

  • Added an offsets argument to chunk() and Chunker.__call__() that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults to False.
  • Added an overlap argument to chunk() and Chunker.__call__() that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults to None, in which case no overlapping occurs.
  • Added an undocumented, private _make_chunk_function() method to the Chunker class that constructs chunking functions with call-level arguments passed.
  • Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.

Changed

  • Began removing chunks comprised entirely of whitespace characters from the output of chunk().
  • Updated semchunk's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.

Fixed

  • Fixed a typo in the docstring for the __call__() method of the Chunker class returned by chunkerify() where most of the documentation for the arguments were listed under the section for the method's returns.

Removed

  • Removed undocumented, private chunk() method from the Chunker class returned by chunkerify().
  • Removed undocumented, private _reattach_whitespace_splitters argument of chunk() that was introduced to experiment with potentially adding support for overlap ratios.