Skip to content

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Sep 26, 2025

Description

Contributes to #15170

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

vyasr and others added 11 commits September 26, 2025 02:19
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- nvtext/tokenize: 7 functions (tokenize_scalar, tokenize_column, count_tokens_scalar, count_tokens_column, character_tokenize, detokenize, tokenize_with_vocabulary)

All functions now accept optional DeviceMemoryResource parameter for GPU memory management.
Updated .pxd, .pyx, and .pyi files with consistent signatures following established patterns.
This module provides comprehensive text tokenization capabilities with fine-grained memory control.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
* Update 3 functions to accept DeviceMemoryResource parameter:
  - generate_ngrams, generate_character_ngrams, hash_character_ngrams
* Thread memory resource through Column.from_libcudf calls
* Update corresponding .pxd and .pyi files for API consistency
* Maintains backwards compatibility with optional mr parameter

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
* Update 2 functions to accept DeviceMemoryResource parameter:
  - edit_distance, edit_distance_matrix
* Thread memory resource through Column.from_libcudf calls
* Update corresponding .pxd and .pyi files for API consistency
* Add missing Column import to enable compilation
* Maintains backwards compatibility with optional mr parameter

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
* Update 2 functions to accept DeviceMemoryResource parameter:
  - replace_tokens, filter_tokens
* Thread memory resource through Column.from_libcudf calls
* Update corresponding .pxd and .pyi files for API consistency
* Maintains backwards compatibility with optional mr parameter

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Updated normalize.pyx and normalize.pxd to support DeviceMemoryResource for text normalization functions:
- normalize_spaces: Normalize whitespace in strings
- normalize_characters: Normalize characters for tokenization
- Added DeviceMemoryResource parameters to function signatures in .pxd file
- Updated functions to accept and process DeviceMemoryResource parameters
- Pass memory resource to Column.from_libcudf() for proper memory management
Updated byte_pair_encode.pyx and byte_pair_encode.pxd to support DeviceMemoryResource for byte-pair encoding:
- byte_pair_encoding: Encode strings using byte-pair encoding algorithm
- Added DeviceMemoryResource parameter to function signature in .pxd file
- Updated function to accept and process DeviceMemoryResource parameter
- Pass memory resource to Column.from_libcudf() for proper memory management
Updated ngrams_tokenize.pyx and ngrams_tokenize.pxd to support DeviceMemoryResource for n-gram tokenization:
- ngrams_tokenize: Generate n-grams from tokenized strings
- Added DeviceMemoryResource parameter to function signature in .pxd file
- Updated function to accept and process DeviceMemoryResource parameter
- Pass memory resource to Column.from_libcudf() for proper memory management
This commit finalizes DeviceMemoryResource support across all 12 nvtext modules in pylibcudf:

**Modules Updated (12/12):**
- byte_pair_encode, deduplicate, edit_distance, generate_ngrams
- jaccard, minhash, ngrams_tokenize, normalize, replace
- stemmer, tokenize, wordpiece_tokenize

**Changes Made:**
- Added device_memory_resource* parameters to all C++ function declarations in libcudf .pxd files
- Updated all .pyx files to include DeviceMemoryResource parameters and mr.get_mr() calls
- Added DeviceMemoryResource imports and parameters to all .pyi type stub files
- Fixed function signature alignment between .pxd and .pyx files
- Ensured proper memory resource handling for functions that support it in C++

**Key Fixes:**
- Resolved minhash compilation issue caused by misaligned .pxd/.pyx signatures
- Properly handled functions that don't support DeviceMemoryResource (e.g., stemmer.is_letter)
- Added comprehensive type annotations for better IDE support

**Build Status:** ✅ All modules compile successfully
**Test Status:** ✅ Full pylibcudf build passes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vyasr vyasr self-assigned this Sep 26, 2025
@vyasr vyasr requested a review from a team as a code owner September 26, 2025 03:51
@vyasr vyasr added the feature request New feature or request label Sep 26, 2025
@vyasr vyasr requested a review from TomAugspurger September 26, 2025 03:51
@vyasr vyasr added the non-breaking Non-breaking change label Sep 26, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Sep 26, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Sep 26, 2025
@vyasr
Copy link
Contributor Author

vyasr commented Sep 26, 2025

/merge

@rapids-bot rapids-bot bot merged commit 52cc7f5 into rapidsai:branch-25.12 Sep 26, 2025
132 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Sep 26, 2025
@vyasr vyasr deleted the feat/memory_resource_part9 branch September 26, 2025 18:03
TomAugspurger pushed a commit to TomAugspurger/pygdf that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants