Add memory resources to all nvtext APIs #20119

vyasr · 2025-09-26T03:51:46Z

Description

Contributes to #15170

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- nvtext/tokenize: 7 functions (tokenize_scalar, tokenize_column, count_tokens_scalar, count_tokens_column, character_tokenize, detokenize, tokenize_with_vocabulary) All functions now accept optional DeviceMemoryResource parameter for GPU memory management. Updated .pxd, .pyx, and .pyi files with consistent signatures following established patterns. This module provides comprehensive text tokenization capabilities with fine-grained memory control. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

* Update 3 functions to accept DeviceMemoryResource parameter: - generate_ngrams, generate_character_ngrams, hash_character_ngrams * Thread memory resource through Column.from_libcudf calls * Update corresponding .pxd and .pyi files for API consistency * Maintains backwards compatibility with optional mr parameter 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

* Update 2 functions to accept DeviceMemoryResource parameter: - edit_distance, edit_distance_matrix * Thread memory resource through Column.from_libcudf calls * Update corresponding .pxd and .pyi files for API consistency * Add missing Column import to enable compilation * Maintains backwards compatibility with optional mr parameter 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

* Update 2 functions to accept DeviceMemoryResource parameter: - replace_tokens, filter_tokens * Thread memory resource through Column.from_libcudf calls * Update corresponding .pxd and .pyi files for API consistency * Maintains backwards compatibility with optional mr parameter 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Updated normalize.pyx and normalize.pxd to support DeviceMemoryResource for text normalization functions: - normalize_spaces: Normalize whitespace in strings - normalize_characters: Normalize characters for tokenization - Added DeviceMemoryResource parameters to function signatures in .pxd file - Updated functions to accept and process DeviceMemoryResource parameters - Pass memory resource to Column.from_libcudf() for proper memory management

Updated byte_pair_encode.pyx and byte_pair_encode.pxd to support DeviceMemoryResource for byte-pair encoding: - byte_pair_encoding: Encode strings using byte-pair encoding algorithm - Added DeviceMemoryResource parameter to function signature in .pxd file - Updated function to accept and process DeviceMemoryResource parameter - Pass memory resource to Column.from_libcudf() for proper memory management

Updated ngrams_tokenize.pyx and ngrams_tokenize.pxd to support DeviceMemoryResource for n-gram tokenization: - ngrams_tokenize: Generate n-grams from tokenized strings - Added DeviceMemoryResource parameter to function signature in .pxd file - Updated function to accept and process DeviceMemoryResource parameter - Pass memory resource to Column.from_libcudf() for proper memory management

…cate, wordpiece_tokenize)

This commit finalizes DeviceMemoryResource support across all 12 nvtext modules in pylibcudf: **Modules Updated (12/12):** - byte_pair_encode, deduplicate, edit_distance, generate_ngrams - jaccard, minhash, ngrams_tokenize, normalize, replace - stemmer, tokenize, wordpiece_tokenize **Changes Made:** - Added device_memory_resource* parameters to all C++ function declarations in libcudf .pxd files - Updated all .pyx files to include DeviceMemoryResource parameters and mr.get_mr() calls - Added DeviceMemoryResource imports and parameters to all .pyi type stub files - Fixed function signature alignment between .pxd and .pyx files - Ensured proper memory resource handling for functions that support it in C++ **Key Fixes:** - Resolved minhash compilation issue caused by misaligned .pxd/.pyx signatures - Properly handled functions that don't support DeviceMemoryResource (e.g., stemmer.is_letter) - Added comprehensive type annotations for better IDE support **Build Status:** ✅ All modules compile successfully **Test Status:** ✅ Full pylibcudf build passes 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

vyasr · 2025-09-26T18:03:18Z

/merge

Contributes to rapidsai#15170 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#20119

vyasr and others added 11 commits September 26, 2025 02:19

Add DeviceMemoryResource support to nvtext/stemmer module

47648c3

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Add DeviceMemoryResource support to nvtext/jaccard module

9f935ff

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Add DeviceMemoryResource support to remaining nvtext modules (dedupli…

b38ca22

…cate, wordpiece_tokenize)

vyasr self-assigned this Sep 26, 2025

vyasr requested a review from a team as a code owner September 26, 2025 03:51

vyasr added the feature request New feature or request label Sep 26, 2025

vyasr requested a review from TomAugspurger September 26, 2025 03:51

vyasr added the non-breaking Non-breaking change label Sep 26, 2025

vyasr requested a review from brandon-b-miller September 26, 2025 03:51

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Sep 26, 2025

github-project-automation bot added this to cuDF Python Sep 26, 2025

GPUtester moved this to In Progress in cuDF Python Sep 26, 2025

mroeschke approved these changes Sep 26, 2025

View reviewed changes

rapids-bot bot merged commit 52cc7f5 into rapidsai:branch-25.12 Sep 26, 2025
132 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python Sep 26, 2025

vyasr deleted the feat/memory_resource_part9 branch September 26, 2025 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add memory resources to all nvtext APIs #20119

Add memory resources to all nvtext APIs #20119

vyasr commented Sep 26, 2025

Uh oh!

vyasr commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Add memory resources to all nvtext APIs #20119

Add memory resources to all nvtext APIs #20119

Conversation

vyasr commented Sep 26, 2025

Description

Checklist

Uh oh!

vyasr commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!