Description
🚀 The feature, motivation and pitch
Similarly to #8932, we should be able to conditionally compile portable ops to do some vectorization. I imagine that this would look like either passing a second lambda to our util functions, or perhaps passing template lambdas that we then could use for both some scalar T and also Vectorized<T>
. The second option would require us to get an std-workalike interface to Vectorized operations so that things like exp
would work seemlessly, which probably would have a similar solution to pytorch/pytorch#144495 .
RFC
As a concrete example, op_add currently calls a util workhorse function with a lambda:
utils::apply_bitensor_elementwise_fn<CTYPE_COMPUTE, op_name>(
[val_alpha](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b) {
return val_a + val_alpha * val_b;
},
We could imagine instead making the call look like this, with a template lambda, so that we could seamlessly use the lambda with Vectorized:
utils::apply_bitensor_elementwise_fn<CTYPE_COMPUTE, op_name>(
[val_alpha](const auto val_a, const auto val_b) {
return val_a + val_alpha * val_b;
},
A second, harder example is op_exp
:
Tensor& exp_out(KernelRuntimeContext& ctx, const Tensor& in, Tensor& out) {
return internal::unary_ufunc_realhbbf16_to_floathbf16(std::exp, ctx, in, out);
}
I think ideally we would find a solution to the above-mentioned PyTorch issue and then write this as
Tensor& exp_out(KernelRuntimeContext& ctx, const Tensor& in, Tensor& out) {
return internal::unary_ufunc_realhbbf16_to_floathbf16_v2([](auto x) { return c10::math::exp(x); }, ctx, in, out);
}
using a template lambda that could be instantiated with either a scalar or Vectorized, as outlined above.