Add new nvtext::normalize_characters API #17818

davidwendt · 2025-01-24T21:29:14Z

Description

Adds new normalizer APIs as part of the rework for the subword-tokenizer.
The new API is split into 2 parts. First a normalizer object is created with appropriate state: lower-case and special-tokens. The normalizing tables are currently hardcoded inside libcudf. Future versions of the this may load these tables from some other source. The 2nd API is given the input strings column and the normalizer object and returns a normalized strings column. The normalizer object can be reused on all subsequent normalize_characters calls.

The current nvtext::normalize_characters loads the normalizing tables on each call which can be significant overhead. This API will be deprecated and replaced by these 2 new ones. Some utility functions from that implementation have been refactored to be used by both until the old one is removed.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-01-24T21:29:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Add new nvtext::normalize_characters API

dd51eb3

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 24, 2025

davidwendt self-assigned this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new nvtext::normalize_characters API #17818

Add new nvtext::normalize_characters API #17818

davidwendt commented Jan 24, 2025

copy-pr-bot bot commented Jan 24, 2025

Add new nvtext::normalize_characters API #17818

Are you sure you want to change the base?

Add new nvtext::normalize_characters API #17818

Conversation

davidwendt commented Jan 24, 2025

Description

Checklist

copy-pr-bot bot commented Jan 24, 2025