Skip to content

[SYCL][CUDA] Implementation of matrix ext using new "unified" interface #7077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Dec 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
fdc4c42
Allow joint_matrix to be loaded from const.
JackAKirk Aug 5, 2022
68d3150
removed duplicates.
JackAKirk Aug 5, 2022
4949464
Layout accumulator is specified at load/store.
JackAKirk Aug 29, 2022
8c09910
joint_matrix_mad takes D matrix as argument.
JackAKirk Aug 30, 2022
e55e5f0
Add new mma cases enabled by joint_matrix_mad.
JackAKirk Sep 1, 2022
a881055
packed_a, packed_b -> packed
JackAKirk Sep 1, 2022
5b84434
Made interface compatible with intel backend.
Sep 14, 2022
75774f2
Merge branch 'sycl' into nvptx-matrix-const
Sep 21, 2022
5c03b3f
Merge branch 'nvptx-matrix-const' into update-matrix-interface
Oct 6, 2022
ccdb544
added unified header, moved nvptx specific impl.
JackAKirk Oct 7, 2022
331760a
Merge branch 'sycl' into update-matrix-interface
JackAKirk Oct 7, 2022
46e87a1
(very) draft updated interfaces.
Oct 7, 2022
766fd8c
cuda joint_matrix partial specializations in separate file.
JackAKirk Oct 10, 2022
32dafa3
Merge branch 'sycl' into unified-interface
JackAKirk Oct 14, 2022
24d3aa1
Refactoring and supporting loading from const.
JackAKirk Oct 14, 2022
b9a051f
Improve error msg and pass by ref.
JackAKirk Oct 17, 2022
3dbeadb
updated device code tests.
JackAKirk Oct 17, 2022
49147d3
Merge branch 'sycl' into unified-interface
JackAKirk Oct 17, 2022
ee1208e
format.
JackAKirk Oct 17, 2022
446c0a0
format
JackAKirk Oct 17, 2022
8da0aa7
fix failed tests.
JackAKirk Oct 18, 2022
36004a0
added access::decorated.
JackAKirk Oct 19, 2022
5f02a0c
Restrict Use to use::accumulator.
Oct 27, 2022
f64e861
Various changes for future Intel/AMD compatibility.
JackAKirk Nov 4, 2022
4b88d94
update joint_matrix constr in tests.
JackAKirk Nov 4, 2022
1f9a8d3
format
JackAKirk Nov 4, 2022
a52eb7c
Merge branch 'sycl' into unified-interface
JackAKirk Nov 4, 2022
4b83846
format
JackAKirk Nov 4, 2022
310fe1c
fix merge
JackAKirk Nov 4, 2022
65888e3
correct check.
JackAKirk Nov 4, 2022
08e4974
Removed group argument from joint_matrix.
JackAKirk Nov 7, 2022
5e8f8d7
Removed separate host get_wi_data def.
JackAKirk Nov 8, 2022
51cbf73
format
JackAKirk Nov 8, 2022
ad6621b
format
JackAKirk Nov 8, 2022
2c4898a
Move wi_data to unified header.
JackAKirk Nov 9, 2022
426e7b3
Add constructor taking Group arg.
JackAKirk Nov 10, 2022
57ddc6b
Removed unnecessary friend functs, moved wi_data.
JackAKirk Nov 16, 2022
d4607a8
format.
JackAKirk Nov 16, 2022
47b6714
Removed unused constructor, get_wi_marray.
JackAKirk Nov 22, 2022
0ca1223
Made get_wi_data free function, removed default group param.
JackAKirk Dec 2, 2022
cede44d
Merge branch 'sycl' into unified-interface
JackAKirk Dec 2, 2022
5abcca6
Take account of bfloat16 moving out of experimental.
JackAKirk Dec 2, 2022
8e46d78
format.
JackAKirk Dec 2, 2022
8bf5048
update device code checks.
JackAKirk Dec 2, 2022
eb51b53
format.
JackAKirk Dec 2, 2022
b9ca55c
get_wi_data no longer auto to host return removed.
JackAKirk Dec 5, 2022
bb6fc5e
Merge branch 'sycl' into unified-interface
JackAKirk Dec 9, 2022
68fcf9a
Merge branch 'sycl' into unified-interface
JackAKirk Dec 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
//===---- matrix-tensorcore.hpp - SYCL tensor cores matrix ----*- C++ -*---===//
//===-------------- matrix-tensorcores-legacy.hpp - -----------*- C++ -*---===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
Expand Down
639 changes: 639 additions & 0 deletions sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcores.hpp

Large diffs are not rendered by default.

221 changes: 221 additions & 0 deletions sycl/include/sycl/ext/oneapi/matrix/matrix-unified.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
//===------- matrix-unified.hpp - SYCL matrix extension ----*- C++ -*------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// ===--------------------------------------------------------------------=== //

#pragma once
#include <sycl/ext/oneapi/matrix/matrix-tensorcores.hpp>

namespace sycl {
__SYCL_INLINE_VER_NAMESPACE(_V1) {
namespace ext {
namespace oneapi {
namespace experimental {
namespace matrix {

template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
layout Layout>
struct joint_matrix {

#if defined(__SYCL_DEVICE_ONLY__) && defined(__SPIR__)
Copy link
Contributor

@yubingex007-a11y yubingex007-a11y Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AerialMantis @JackAKirk
sorry, i remember previously it is:

#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
  sycl::ext::oneapi::detail::joint_matrix_cuda<T, Use, Rows, Cols, Layout>
      cuda_impl;
#else
  __spv::__spirv_JointMatrixINTEL<
      T, Rows, Cols, spv_matrix_layout_traits<Layout>::value,
      spv_scope_traits<Group>::value, spv_matrix_use_traits<Use>::value> *spvm;
#endif // defined(__SYCL_DEVICE_ONLY__)
#endif

in intel side, we can't let host compilation use sycl::ext::oneapi::detail::joint_matrix_cuda. so i go back to the previous code and i can still get passed in cuda's testcases.

// TODO: Intel case here: we use the ext_oneapi_cuda case also for the host,
// because the Intel SPIRV functions will not be host compilable.
#else
sycl::ext::oneapi::detail::joint_matrix_cuda<T, Use, Rows, Cols, Layout>
cuda_impl;
#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__SPIR__)

joint_matrix() {
#ifndef __SYCL_DEVICE_ONLY__
throw runtime_error("joint matrix is not supported on host device.",
PI_ERROR_INVALID_DEVICE);
#endif
}
};

template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
layout Layout>
inline __SYCL_ALWAYS_INLINE wi_data<Group, T, Use, Rows, Cols, Layout>
get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> &jm) {
#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
std::ignore = sg;
return wi_data(jm);
#else
// TODO add Intel impl.
Copy link
Contributor

@yubingex007-a11y yubingex007-a11y Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AerialMantis @JackAKirk @dkhaldi
since we can't provide wi_data in both cuda&intel's header, i will make wi_data unified again and provide wi_data of host version, so the return type should be "decltype(auto)".

#endif // defined(__NVPTX__)
#endif // defined(__SYCL_DEVICE_ONLY__)
}

template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use,
layout Layout, typename T2>
inline __SYCL_ALWAYS_INLINE void
joint_matrix_fill(Group sg,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed you don't have any test with joint_matrix_fill. Is that on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have device code tests in intel/llvm for functions that call nvptx builtins. joint_matrix_fill simply sets all elements to the const value.

joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
const T2 &v) {
#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
std::ignore = sg;
res.cuda_impl.wi_marray = v;
#endif // defined(__NVPTX__)
#else
std::ignore = sg;
std::ignore = res;
std::ignore = v;
throw runtime_error(
"This version of the matrix extension is only currently supported on "
"Nvidia devices",
PI_ERROR_INVALID_DEVICE);
#endif // defined(__SYCL_DEVICE_ONLY__)
}

template <
typename Group, typename S, typename T, size_t NumRows, size_t NumCols,
access::address_space Space, access::decorated IsDecorated,
std::enable_if_t<std::is_same<S, std::remove_const_t<T>>::value, bool> =
true>
inline __SYCL_ALWAYS_INLINE void joint_matrix_load(
Group sg,
joint_matrix<Group, S, use::accumulator, NumRows, NumCols,
sycl::ext::oneapi::experimental::matrix::layout::dynamic> &res,
multi_ptr<T, Space, IsDecorated> src, size_t stride,
sycl::ext::oneapi::experimental::matrix::layout Layout) {
#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
std::ignore = sg;
sycl::ext::oneapi::detail::load_accumulator_cuda(res.cuda_impl, src, stride,
Layout);
#endif // defined(__NVPTX__)
#else
std::ignore = sg;
std::ignore = res;
std::ignore = src;
std::ignore = stride;
throw runtime_error(
"This version of the matrix extension is only currently supported on "
"Nvidia devices",
PI_ERROR_INVALID_DEVICE);
#endif // defined(__SYCL_DEVICE_ONLY__)
}

template <
typename Group, typename S, typename T, use Use, size_t NumRows,
size_t NumCols, matrix::layout Layout, access::address_space Space,
access::decorated IsDecorated,
std::enable_if_t<std::is_same<S, std::remove_const_t<T>>::value ||
(std::is_same<S, precision::tf32>::value &&
std::is_same<std::remove_const_t<T>, float>::value),
bool> = true>
inline __SYCL_ALWAYS_INLINE void
joint_matrix_load(Group sg,
joint_matrix<Group, S, Use, NumRows, NumCols, Layout> &res,
multi_ptr<T, Space, IsDecorated> src, size_t stride) {
#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
std::ignore = sg;
sycl::ext::oneapi::detail::load_multiplicand_cuda<S, T, NumRows, NumCols, Use,
Layout, Space>(
res.cuda_impl, src, stride);
#endif // defined(__NVPTX__)
#else
std::ignore = sg;
std::ignore = res;
std::ignore = src;
std::ignore = stride;
throw runtime_error(
"This version of the matrix extension is only currently supported on "
"Nvidia devices",
PI_ERROR_INVALID_DEVICE);
#endif // defined(__SYCL_DEVICE_ONLY__)
}

template <typename Group, typename T, size_t NumRows, size_t NumCols,
access::address_space Space, access::decorated IsDecorated>
inline __SYCL_ALWAYS_INLINE void joint_matrix_store(
Group sg,
joint_matrix<Group, T, use::accumulator, NumRows, NumCols,
sycl::ext::oneapi::experimental::matrix::layout::dynamic> &src,
multi_ptr<T, Space, IsDecorated> dst, size_t stride,
sycl::ext::oneapi::experimental::matrix::layout Layout) {
#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
std::ignore = sg;
sycl::ext::oneapi::detail::joint_matrix_store_cuda<T, NumRows, NumCols,
Space>(src.cuda_impl, dst,
stride, Layout);
#endif // defined(__NVPTX__)
#else
std::ignore = sg;
std::ignore = src;
std::ignore = dst;
std::ignore = stride;
throw runtime_error(
"This version of the matrix extension is only currently supported on "
"Nvidia devices",
PI_ERROR_INVALID_DEVICE);
#endif // defined(__SYCL_DEVICE_ONLY__)
}

template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M,
std::size_t K, std::size_t N, layout LayoutA, layout LayoutB>
inline __SYCL_ALWAYS_INLINE
joint_matrix<Group, Tc, use::accumulator, M, N,
sycl::ext::oneapi::experimental::matrix::layout::dynamic>
joint_matrix_mad(
Group sg, joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A,
joint_matrix<Group, Tb, use::b, K, N, LayoutB> &B,
joint_matrix<Group, Tc, use::accumulator, M, N,
sycl::ext::oneapi::experimental::matrix::layout::dynamic>
&C) {
#if defined(__SYCL_DEVICE_ONLY__)
#if defined(__NVPTX__)
std::ignore = sg;
if constexpr (std::is_same<Ta, Tb>::value) {
joint_matrix<Group, Tc, use::accumulator, M, N,
sycl::ext::oneapi::experimental::matrix::layout::dynamic>
D;
sycl::ext::oneapi::detail::joint_matrix_mad_cuda<Ta, Tc, M, K, N, LayoutA,
LayoutB>(
D.cuda_impl, A.cuda_impl, B.cuda_impl, C.cuda_impl);
return D;
} else {
assert(false && "Ta != Tb : In the CUDA backend joint_matrix_mad "
"requires that joint_matrix data types Ta and Tb match");
}
#endif // defined(__NVPTX__)
#else
std::ignore = sg;
std::ignore = A;
std::ignore = B;
std::ignore = C;
throw runtime_error(
"This version of the matrix extension is only currently supported on "
"Nvidia devices",
PI_ERROR_INVALID_DEVICE);
#endif // defined(__SYCL_DEVICE_ONLY__)
}

// This function rounds the bottom 13 bits up or down, and then zeros out the
// bottom bits
inline __SYCL_ALWAYS_INLINE float round_to_tf32(float &a) {
#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
int32_t tmp_int = __nvvm_f2tf32_rna(a);
return __nvvm_bitcast_i2f(tmp_int);
#else
uint32_t tmp_uint = reinterpret_cast<uint32_t &>(a);
tmp_uint += 0x1000u;
tmp_uint &= 0xFFFFE000u;
float ret = reinterpret_cast<float &>(tmp_uint);
return ret;
#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
}

} // namespace matrix
} // namespace experimental
} // namespace oneapi
} // namespace ext
} // __SYCL_INLINE_VER_NAMESPACE(_V1)
} // namespace sycl
5 changes: 4 additions & 1 deletion sycl/include/sycl/ext/oneapi/matrix/matrix.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,8 @@
#include <sycl/ext/oneapi/matrix/static-query-use.hpp>
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 3)
#include <sycl/ext/oneapi/matrix/matrix-tensorcore.hpp>
#include <sycl/ext/oneapi/matrix/matrix-tensorcores-legacy.hpp>
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4)
#include <sycl/ext/oneapi/matrix/matrix-unified.hpp>
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
Loading