Skip to content

[libc++] Start implementing std::datapar::simd #139919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions libcxx/include/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -714,6 +714,10 @@ set(files
__ranges/view_interface.h
__ranges/views.h
__ranges/zip_view.h
__simd/abi.h
__simd/basic_simd.h
__simd/basic_simd_mask.h
__simd/simd_flags.h
__split_buffer
__std_mbstate_t.h
__stop_token/atomic_unique_lock.h
Expand Down Expand Up @@ -863,6 +867,7 @@ set(files
__type_traits/maybe_const.h
__type_traits/nat.h
__type_traits/negation.h
__type_traits/pack_utils.h
__type_traits/promote.h
__type_traits/rank.h
__type_traits/remove_all_extents.h
Expand All @@ -875,6 +880,7 @@ set(files
__type_traits/remove_reference.h
__type_traits/remove_volatile.h
__type_traits/result_of.h
__type_traits/standard_types.h
__type_traits/strip_signature.h
__type_traits/type_identity.h
__type_traits/type_list.h
Expand Down Expand Up @@ -1029,6 +1035,7 @@ set(files
semaphore
set
shared_mutex
simd
source_location
span
sstream
Expand Down
165 changes: 165 additions & 0 deletions libcxx/include/__simd/abi.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
//===----------------------------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef _LIBCPP___SIMD_ABI_H
#define _LIBCPP___SIMD_ABI_H

#include <__concepts/convertible_to.h>
#include <__concepts/equality_comparable.h>
#include <__config>
#include <__cstddef/size_t.h>
#include <__type_traits/standard_types.h>
#include <__utility/integer_sequence.h>
#include <cstdint>

#if _LIBCPP_STD_VER >= 26

_LIBCPP_BEGIN_NAMESPACE_STD
namespace datapar {

template <class _Tp>
inline constexpr bool __is_vectorizable_type_v = __is_standard_integer_type_v<_Tp> || __is_character_type_v<_Tp>;

template <>
inline constexpr bool __is_vectorizable_type_v<float> = true;

template <>
inline constexpr bool __is_vectorizable_type_v<double> = true;

template <class _From, class _To>
concept __value_preserving_convertible = requires(_From __from) { _To{__from}; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this at least requires:

requires (integral<_From> && integral<_To>) || (floating_point<_From> && floating_point<_To>)

Or something similar, maybe __is_vectorizable_type instead. But either way this needs a requirement to make it clear that this isn't a very general utility despite its name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. int32_t -> float64_t is value-preserving, as is int16_t -> float32_t. However e.g. int32_t -> uint32_t is not value-preserving.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you folks have an implementation in mind for the notion of being value-preserving? From https://eel.is/c++draft/simd#general-8:

The conversion from an arithmetic type U to a vectorizable type T is value-preserving if all possible values of U can be represented with type T.

Is there a clever implementation trick to do this, or do we need to list them manually, or is there something logical that we can do and we're not seeing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEEE floating point numbers of type T~ can exactly represent integers up to +/-(1 << std::numeric_limits::digits()`).

So something like

template <typename From, typename To>
concept value_preserving = std::integral<From> ? 
               (std::floating_point<To> ? std::numeric_limits<To>::digits() >= 
                                          8*sizeof(From) - std::signed_integral<From> :
           /* handle conversion to integer */     ) : 
           /* handle conversion from non-integer */

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just numeric_limits<To>::digits >= numeric_limits<From>::digits?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a much better way. Just add an extra check to make sure that is_signed_v<To> || !is_signed_v<From>.

Are these supposed to work with <stdfloat>? If so, we also want To's exponent to be at least as large. In particular, float16 has more digits than bfloat16, but a smaller exponent, so neither can represent all values from the other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of any trick. This is what I implemented:

  template <typename _From, typename _To>
    concept __arithmetic_only_value_preserving_convertible_to
      = convertible_to<_From, _To> and __arithmetic<_From> and __arithmetic<_To>
          and not (is_signed_v<_From> and is_unsigned_v<_To>)
          and numeric_limits<_From>::digits <= numeric_limits<_To>::digits
          and numeric_limits<_From>::max() <= numeric_limits<_To>::max()
          and numeric_limits<_From>::lowest() >= numeric_limits<_To>::lowest();

  template <typename _From, typename _To>
    concept __value_preserving_convertible_to
      = __arithmetic_only_value_preserving_convertible_to<_From, _To>
          or (__complex_like<_To> and __arithmetic_only_value_preserving_convertible_to<
                                        _From, typename _To::value_type>);

I think bfloat16_t and float16_t need another check besides only for digits. bfloat16_t is not vectorizable yet, but it's arithmetic at least. The max and lowest compares are mixed-type comparisons but I so far I have convinced myself that even with implicit conversions this concept always gives the right answer.


template <class _Tp>
concept __constexpr_wrapper_like =
convertible_to<_Tp, decltype(_Tp::value)> && equality_comparable_with<_Tp, decltype(_Tp::value)> &&
bool_constant<_Tp() == _Tp::value>::value &&
bool_constant<static_cast<decltype(_Tp::value)>(_Tp()) == _Tp::value>::value;

// [simd.expos]
using __simd_size_type = int;

template <class _Tp>
struct __deduce_abi;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest forward-declaring __vector_size_abi and defining __deduce_abi right here instead of the other way around. Since __deduce_abi is a simple metafunction, it feels kinda weird to have a forward declaration for it and to require jumping around.


template <class _Tp, __simd_size_type _Np>
requires __is_vectorizable_type_v<_Tp> && (_Np <= 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 64 value should be a static constexpr with at least a comment explaining why we chose that value over something else. It also needs to be documented in the documentation for implementation-defined values.

If the rationale is as simple as "this is for MSVC compatibility", that's fine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that simd<int, N> always is a complete type, independent of N. You can only turn it into a disabled specialization if N is "too large". (64 is the minimum required by the spec, you're free to support higher width)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philnik777 I think we're missing the whole https://eel.is/c++draft/simd#overview-1 in the current state of the patch. IIUC, we'd need something like:

// disabled "specialization"
template <class T, class ABI>
class basic_simd {
  basic_simd() = delete;
  basic_simd(basic_simd const&) = delete;
  basic_simd& operator=(basic_simd const&) = delete;
  ~basic_simd() = delete;

  using value_type = T;
  using abi_type = ABI;
  using mask_type = ???;
};

// enabled specialization
template <class T, class ABI>
  requires __is_vectorizable_type<T> && (N-from-deduce-abi-t <= 64)
class basic_simd {
  // what you have right now
};

@mattkretz Is there an intent that users can specialize basic_simd for their own ABI types? If not, @philnik777 we should mark it as [[no_specializations]].

@philnik777 Either way, this would need a test.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no intent for other ABI tags than what the implementation defines. But I'm not sure why we should ban partial specializations of basic_simd for other non-impl-defined ABI tags. Seems like a potentially useful hack that shouldn't cost us any extra work, no?

using __deduce_abi_t = __deduce_abi<_Tp>::template __apply<_Np>;

template <class _Tp>
using __native_abi = __deduce_abi<_Tp>::template __apply<4>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be hardcoded to 4?


template <class _Tp, class _Abi>
inline constexpr __simd_size_type __simd_size_v = 0;

template <size_t>
struct __integer_from_impl;

template <>
struct __integer_from_impl<1> {
using type = uint8_t;
};

template <>
struct __integer_from_impl<2> {
using type = uint16_t;
};

template <>
struct __integer_from_impl<4> {
using type = uint32_t;
};

template <>
struct __integer_from_impl<8> {
using type = uint64_t;
};

template <size_t _Bytes>
using __integer_from = __integer_from_impl<_Bytes>::type;

// ABI Types

template <class _Tp, __simd_size_type _Np>
struct __vector_size_abi {
using _SimdT [[__gnu__::__vector_size__(_Np * sizeof(_Tp))]] = _Tp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use the functionality provided by the compiler-provided SIMD type (e.g. the operators), however this makes it unclear/implicit what API is required from a _SimdT. So for example, if we were to implement a specialization of __vector_size_abi for complex types which requires a custom struct _SimdT, it wouldn't be obvious what API that _SimdT should have. I think it would be acceptable to solve that problem by simply documenting the API that we're expecting and to ensure that we don't use any other properties beyond that from basic_simd.h & friends.

using _MaskT [[__gnu__::__vector_size__(_Np * sizeof(_Tp))]] = __integer_from<sizeof(_Tp)>;

_LIBCPP_ALWAYS_INLINE constexpr _SimdT __select(_MaskT __mask, _SimdT __true, _SimdT __false) {
return __mask ? __true : __false;
}

# ifdef _LIBCPP_COMPILER_CLANG_BASED
using _BoolVec __attribute__((__ext_vector_type__(_Np))) = bool;

static constexpr auto __int_size = _Np <= 8 ? 8 : _Np <= 16 ? 16 : _Np <= 32 ? 32 : 64;
static_assert(__int_size >= _Np);

using _IntSizeBoolVec __attribute__((__ext_vector_type__(__int_size))) = bool;

_LIBCPP_ALWAYS_INLINE static constexpr auto __mask_to_int(_BoolVec __mask) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make whatever we can private to make it clear what's the intended API of __vector_size (i.e. what we expect basic_simd & friends to use.

return [&]<size_t... _Origs, size_t... _Fillers>(index_sequence<_Origs...>, index_sequence<_Fillers...>)
_LIBCPP_ALWAYS_INLINE {
auto __vec = __builtin_convertvector(
__builtin_shufflevector(__mask, _BoolVec{}, _Origs..., ((void)_Fillers, _Np)...), _IntSizeBoolVec);
if constexpr (_Np <= 8)
return __builtin_bit_cast(unsigned char, __vec);
else if constexpr (_Np <= 16)
return __builtin_bit_cast(unsigned short, __vec);
else if constexpr (_Np <= 32)
return __builtin_bit_cast(unsigned int, __vec);
else
return __builtin_bit_cast(unsigned long long, __vec);
}(make_index_sequence<_Np>{}, make_index_sequence<__int_size - _Np>{});
}

_LIBCPP_ALWAYS_INLINE static constexpr bool __any_of(_MaskT __mask) noexcept {
return __builtin_reduce_or(__builtin_convertvector(__mask, _BoolVec));
}

_LIBCPP_ALWAYS_INLINE static constexpr bool __all_of(_MaskT __mask) noexcept {
return __builtin_reduce_and(__builtin_convertvector(__mask, _BoolVec));
}

_LIBCPP_ALWAYS_INLINE static constexpr __simd_size_type __reduce_count(_MaskT __mask) noexcept {
return __builtin_reduce_add(__builtin_convertvector(__builtin_convertvector(__mask, _BoolVec), _MaskT));
}

_LIBCPP_ALWAYS_INLINE static constexpr __simd_size_type __reduce_min_index(_MaskT __mask) noexcept {
return __builtin_ctzg(__mask_to_int(__builtin_convertvector(__mask, _BoolVec)));
}

_LIBCPP_ALWAYS_INLINE static constexpr __simd_size_type __reduce_max_index(_MaskT __mask) noexcept {
return __int_size - 1 - __builtin_clzg(__mask_to_int(__builtin_convertvector(__mask, _BoolVec)));
}
# else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be implemented on GCC before the patch can land.

_LIBCPP_ALWAYS_INLINE constexpr bool __any_of(_MaskT __mask) noexcept {
for (size_t __i = 0; __i != _Np; ++__i) {
if (__mask[__i])
return true;
}
return false;
}
# endif
};

template <class _Tp>
requires __is_vectorizable_type_v<_Tp>
struct __deduce_abi<_Tp> {
template <__simd_size_type _Np>
using __apply = __vector_size_abi<_Tp, _Np>;
};
Comment on lines +150 to +155

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing another specialization, but this will lead to ABI troubles. Consider simd<int, 8>. As defined you'll get the ABI of a [[__gnu__::__vector_size__(8 * sizeof(int))]] type. But that changes ABI depending on AVX compiler flags (two XMM registers vs. one YMM register). The intent of ABI tags is to have a different basic_simd specialization for simd<int, 8> depending on -mavx2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the heads up -- this is an interesting comment and something we missed. Nikolas was handling some of these calling convention differences via _LIBCPP_ALWAYS_INLINE, see this comment, but I think neither of us realized that they would impact any type that also contains one of these [[__gnu__::__vector_size__(8 * sizeof(int))]] types. So it looks like we need to somehow encode the ABI differences in this type, and I think we'll be able to drop all of the uses of _LIBCPP_ALWAYS_INLINE as a result.

I might suggest something like this:

enum class _SimdAbiAffectingFlags {
  _None = 1,
  _AVX2 = 1 << 2,
  _Whatever = 1 << 3,
  // etc...
};

consteval _SimdAbiAffectingFlags _CurrentFlags = _SimdAbiAffectingFlags::_None
#ifdef __AVX2__
  | _SimdAbiAffectingFlags::_AVX2
#endif
#ifdef __WHATEVER__
  | _SimdAbiAffectingFlags::_Whatever
#endif
;

and then we pass _CurrentFlags into __vector_size_abi. This is pseudo-code but I guess something like that would work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The representation of masks affects ABI (vector masks vs. bit masks). Also consider AVX w/o AVX2: you might want to change the ABI of fp simd but not for integral simd.

I also chose the approach that at some point my ABI tags become public API. So even with -mavx2 the user can explicitly instantiate basic_simd with the SSE ABI. This can be useful for I/O or interfacing with shared libraries.

With Clang you have the simplification that __vector_size__ supports any multiple of the value type's sizeof. GCC only supports powers of 2. That informed my ABI choices. (I don't want to require 64 Bytes for a simd<int, 12> — with native 128-bit vectors.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and the ODR issues are more fine-grained than the ABI issues. E.g. SSE2, SSE3, SSSE3, SSE4a, XOP, SSE4.1, SSE4.2 all use the same ABI, but code-gen for inline functions can be different. Most importantly running SSE4 code on an SSE3 systems is a SIGILL. Expect users to compile their TUs multiple times with different compiler flags, link them into one binary and then dispatch according to CPUID. 💣


template <class _Tp, __simd_size_type _Np>
inline constexpr __simd_size_type __simd_size_v<_Tp, __vector_size_abi<_Tp, _Np>> = _Np;

} // namespace datapar
_LIBCPP_END_NAMESPACE_STD

#endif // _LIBCPP_STD_VER >= 26

#endif // _LIBCPP___SIMD_ABI_H
Loading
Loading