Skip to content

Commit c0cff01

Browse files
maartenbreddelskszucs
authored andcommitted
ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes #8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
1 parent 5fa201d commit c0cff01

File tree

12 files changed

+731
-32
lines changed

12 files changed

+731
-32
lines changed

cpp/src/arrow/compute/api_scalar.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,13 @@ struct ARROW_EXPORT StrptimeOptions : public FunctionOptions {
9292
TimeUnit::type unit;
9393
};
9494

95+
struct ARROW_EXPORT TrimOptions : public FunctionOptions {
96+
explicit TrimOptions(std::string characters) : characters(std::move(characters)) {}
97+
98+
/// The individual characters that can be trimmed from the string.
99+
std::string characters;
100+
};
101+
95102
enum CompareOperator : int8_t {
96103
EQUAL,
97104
NOT_EQUAL,

cpp/src/arrow/compute/kernels/codegen_internal.h

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,35 @@ struct OptionsWrapper : public KernelState {
126126
OptionsType options;
127127
};
128128

129+
/// KernelState adapter for when the state is an instance constructed with the
130+
/// KernelContext and the FunctionOptions as argument
131+
template <typename StateType, typename OptionsType>
132+
struct KernelStateFromFunctionOptions : public KernelState {
133+
explicit KernelStateFromFunctionOptions(KernelContext* ctx, OptionsType state)
134+
: state(StateType(ctx, std::move(state))) {}
135+
136+
static std::unique_ptr<KernelState> Init(KernelContext* ctx,
137+
const KernelInitArgs& args) {
138+
if (auto options = static_cast<const OptionsType*>(args.options)) {
139+
return ::arrow::internal::make_unique<KernelStateFromFunctionOptions>(ctx,
140+
*options);
141+
}
142+
143+
ctx->SetStatus(
144+
Status::Invalid("Attempted to initialize KernelState from null FunctionOptions"));
145+
return NULLPTR;
146+
}
147+
148+
static const StateType& Get(const KernelState& state) {
149+
return ::arrow::internal::checked_cast<const KernelStateFromFunctionOptions&>(state)
150+
.state;
151+
}
152+
153+
static const StateType& Get(KernelContext* ctx) { return Get(*ctx->state()); }
154+
155+
StateType state;
156+
};
157+
129158
// ----------------------------------------------------------------------
130159
// Input and output value type definitions
131160

0 commit comments

Comments
 (0)