Skip to content

Conversation

@DevinTDHa
Copy link
Member

Description

This PR adds the annotator DocumentTokenSplitter. This annotator takes a large body of text and splits them into chunks of a given number of tokens. Currently, it splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.

The PR also includes some minor fixes for DocumentCharacterTextSplitter.

Motivation and Context

This annotator makes splitting a large text to feed into language models easy.

How Has This Been Tested?

New and existing tests passing

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@DevinTDHa DevinTDHa added the new-feature Introducing a new feature label Nov 4, 2023
@DevinTDHa DevinTDHa self-assigned this Nov 4, 2023
- Python Side
- Documentation
@DevinTDHa DevinTDHa force-pushed the feature/SPARKNLP-925-DocumentTokenTextSplitter branch from 69484b7 to e978a97 Compare December 2, 2023 17:32
@maziyarpanahi maziyarpanahi changed the base branch from master to release/520-release-candidate December 7, 2023 18:32
@maziyarpanahi maziyarpanahi merged commit 97a541b into JohnSnowLabs:release/520-release-candidate Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature Introducing a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants