SPARKNLP-925 DocumentTokenSplitter #14053

DevinTDHa · 2023-11-04T17:49:43Z

Description

This PR adds the annotator DocumentTokenSplitter. This annotator takes a large body of text and splits them into chunks of a given number of tokens. Currently, it splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.

The PR also includes some minor fixes for DocumentCharacterTextSplitter.

Motivation and Context

This annotator makes splitting a large text to feed into language models easy.

How Has This Been Tested?

New and existing tests passing

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

- Scala Side

- Python Side - Documentation

SPARKNLP-925: DocumentTokenSplitter

08a41a1

- Scala Side

DevinTDHa added the new-feature Introducing a new feature label Nov 4, 2023

DevinTDHa requested a review from maziyarpanahi November 4, 2023 17:49

DevinTDHa self-assigned this Nov 4, 2023

SPARKNLP-925: DocumentTokenSplitter

e978a97

- Python Side - Documentation

DevinTDHa force-pushed the feature/SPARKNLP-925-DocumentTokenTextSplitter branch from 69484b7 to e978a97 Compare December 2, 2023 17:32

maziyarpanahi approved these changes Dec 7, 2023

View reviewed changes

maziyarpanahi changed the base branch from master to release/520-release-candidate December 7, 2023 18:32

maziyarpanahi merged commit 97a541b into JohnSnowLabs:release/520-release-candidate Dec 7, 2023

maziyarpanahi mentioned this pull request Dec 7, 2023

520-release-candidate #14084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARKNLP-925 DocumentTokenSplitter #14053

SPARKNLP-925 DocumentTokenSplitter #14053

Uh oh!

DevinTDHa commented Nov 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SPARKNLP-925 DocumentTokenSplitter #14053

SPARKNLP-925 DocumentTokenSplitter #14053

Uh oh!

Conversation

DevinTDHa commented Nov 4, 2023

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants