Skip to content

[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

Open
@jlowe

Description

Is your feature request related to a problem? Please describe.
Regular expression processing can require a significant amount of temporary memory. The RAPIDS Accelerator for Apache Spark needs the ability to control how much GPU memory is used for these operations in order to avoid excessive spilling or GPU out of memory errors when the user provides a particularly complicated regex pattern and/or large input data.

Describe the solution you'd like
The libcudf regular expression APIs accept an optional parameter to be specified which is an upper bound on the amount of temporary GPU memory to use for regular expression processing. If the value is below the "natural" size for full concurrency, the algorithm would reduce the concurrency to fit within the memory bound. I would expect there would be a lower-limit below which regex processing would not be possible within the requested memory limit.

Describe alternatives you've considered
Instead of APIs focused on limiting memory there could be APIs to report what will be used without the ability to control it, such as the one implemented in #10808. This type of API does not allow the caller to tradeoff between GPU memory usage and GPU performance, as it either will fit in GPU memory or it won't. If reported as too big the RAPIDS Accelerator would be forced to fallback to the CPU to perform the regex processing (with the requisite columnar to row formatted data transform and back).

The RAPIDS Accelerator currently does not support falling back to the CPU after query planning has completed on the Spark driver (which does not have a GPU), and the query planning does not have access to the string data to search (only the regex pattern to use). Even with a memory size reporting API, without the input data the API would have to be a worst-case estimate that could cause an unnecessary fallback to the CPU.

Metadata

Assignees

Labels

feature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.stringsstrings issues (C++ and Python)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions