intel · pvchupin · Nov 28, 2022 · Aug 3, 2022 · Aug 3, 2022 · Aug 24, 2022
@@ -22,53 +22,43 @@
 
 == Notice
 
-IMPORTANT: This specification is a draft.
+[%hardbreaks]
+Copyright (C) 2022-2022 Intel Corporation.  All rights reserved.
 
-Copyright (c) 2021-2022 Intel Corporation. All rights reserved.
+Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
+of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by
+permission by Khronos.
 
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
 
-== Dependencies
+== Contact
 
-This extension is written against the SYCL 2020 specification, Revision 4.
+To report problems with this extension, please open a new issue at:
 
-== Status
+https://github.com/intel/llvm/issues
 
-Draft
 
-This is a preview extension specification, intended to provide early access to
-a feature for review and community feedback. When the feature matures, this
-specification may be released as a formal extension.
+== Dependencies
 
-Because the interfaces defined by this specification are not final and are
-subject to change they are not intended to be used by shipping software
-products.
+This extension is written against the SYCL 2020 specification, Revision 5.
 
-== Version
+== Status
 
-Revision: 5
+This extension is implemented and fully supported by DPC++.
+[NOTE]
+====
+This extension is currently implemented in `dpcpp` only for GPU devices that support `bfloat16` natively. Attempting to use this extension in
+kernels that run on other devices may result in undefined behavior.
+Be aware that the compiler is not able to issue a diagnostic to warn you if this happens.
+====
 
-== Introduction
+== Overview
 
-This extension adds functionality to convert value of single-precision
-floating-point type(`float`) to `bfloat16` type and vice versa. The extension
-doesn't add support for `bfloat16` type as such, instead it uses 16-bit integer
-type(`uint16_t`) as a storage for `bfloat16` values.
+This extension adds support for a 16-bit floating point type `bfloat16`. This type occupies 16 bits of storage space as does the `sycl::half` type. However, `bfloat16` allots 8 bits to the exponent instead of the 5 bits used by `sycl::half` and 7 bits to the significand versus 10 bits used by `sycl::half`. Thus, `bfloat16` has the same dynamic range as a 32-bit `float` but with reduced precision. This type is useful when memory required to store the values must be reduced, and when the calculations require high dynamic range but can tolerate lower-precision. Some implementations may still perform operations on this type using 32-bit math. For example, they may convert the `bfloat16` value to `float`, and then perform the operation on the 32-bit `float`.
 
-The purpose of conversion from float to bfloat16 is to reduce the amount of memory
-required to store floating-point numbers. Computations are expected to be done with
-32-bit floating-point values.
 
-This extension is an optional kernel feature as described in
-https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:optional-kernel-features[section 5.7]
-of the SYCL 2020 spec. Therefore, attempting to submit a kernel using this
-feature to a device that does not support it should cause a synchronous
-`errc::kernel_not_supported` exception to be thrown from the kernel invocation
-command (e.g. from `parallel_for`).
+== Specification
 
-== Feature test macro
+=== Feature test macro
 
 This extension provides a feature-test macro as described in the core SYCL
 specification section 6.3.3 "Feature test macros". Therefore, an implementation
@@ -84,7 +74,7 @@ the implementation supports this feature, or applications can test the macro’s
 |1     |Initial extension version. Base features are supported.
 |===
 
-== Extension to `enum class aspect`
+=== Extension to `enum class aspect`
 
 [source]
 ----
@@ -99,49 +89,47 @@ enum class aspect {
 If a SYCL device has the `ext_oneapi_bfloat16` aspect, then it natively
 supports conversion of values of `float` type to `bfloat16` and back.
 
-If the device doesn't have the aspect, objects of `bfloat16` class must not be
-used in the device code.
+This extension is an optional kernel feature as described in section 5.7 of the SYCL 2020 spec, with the associated aspect `ext_oneapi_bfloat16`. Applications can query whether the device has this aspect to determine if it supports kernels that use `bfloat16`. Attempting to submit a kernel using `bfloat16` to a device that does not support it causes a synchronous `errc::kernel_not_supported` exception to be thrown from the kernel invocation command (e.g. from `parallel_for`).
 
-**NOTE**: The `ext_oneapi_bfloat16` aspect is not yet supported.  The
-`bfloat16` class is currently supported only on Xe HP GPU and Nvidia GPUs with Compute Capability >= SM80.
+[NOTE]
+====
+. DPC++ does not currently implement the `errc::kernel_not_supported` exception in this case. Attempting to submit a kernel using `bfloat16` to a device that does not have the `ext_oneapi_bfloat16` aspect results in undefined behavior.
+. The `bfloat16` class is currently supported only on Xe HP GPUs and Nvidia GPUs with Compute Capability >= SM80.
+====
 
-== New `bfloat16` class
 
-The `bfloat16` class below provides the conversion functionality. Conversion
-from `float` to `bfloat16` is done with round to nearest even(RTE) rounding
-mode.
+=== New `bfloat16` class
+
+The `bfloat16` type represents a 16-bit floating point value. Conversions from `float` to `bfloat16` are done with round to nearest even (RTE) rounding mode.
 
 [source]
 ----
 namespace sycl {
 namespace ext {
 namespace oneapi {
-namespace experimental {
 
 class bfloat16 {
-  using storage_t = uint16_t;
-  storage_t value;
 
 public:
   bfloat16() = default;
   bfloat16(const bfloat16 &) = default;
   ~bfloat16() = default;
 
-  // Explicit conversion functions
-  static storage_t from_float(const float &a);
-  static float to_float(const storage_t &a);
-
   // Convert from float to bfloat16
   bfloat16(const float &a);
   bfloat16 &operator=(const float &a);
 
-  // Convert from bfloat16 to float
+  // Convert bfloat16 to float
   operator float() const;
+
+  // Convert from sycl::half to bfloat16
+  bfloat16(const sycl::half &a);
+  bfloat16 &operator=(const sycl::half &a);
 
-  // Get bfloat16 as uint16.
-  operator storage_t() const;
+  // Convert bfloat16 to sycl::half
+  operator sycl::half() const;
 
-  // Convert to bool type
+  // Convert bfloat16 to bool type
   explicit operator bool();
 
   friend bfloat16 operator-(bfloat16 &bf) { /* ... */ }
@@ -170,7 +158,6 @@ public:
   friend bool operatorOP(const T &lhs, const bfloat16 &rhs) { /* ... */ }
 };
 
-} // namespace experimental
 } // namespace oneapi
 } // namespace ext
 } // namespace sycl
@@ -180,12 +167,6 @@ Table 1. Member functions of `bfloat16` class.
 |===
 | Member Function | Description
 
-|  `static storage_t from_float(const float &a);`
-|  Explicitly convert from `float` to `bfloat16`.
-
-|  `static float to_float(const storage_t &a);`
-|  Interpret `a` as `bfloat16` and explicitly convert it to `float`.
-
 | `bfloat16(const float& a);`
 | Construct `bfloat16` from `float`. Converts `float` to `bfloat16`.
 
@@ -195,11 +176,17 @@ Table 1. Member functions of `bfloat16` class.
 | `operator float() const;`
 |  Return `bfloat16` value converted to `float`.
 
-| `operator storage_t() const;`
-| Return `uint16_t` value, whose bits represent `bfloat16` value.
+| `bfloat16(const sycl::half& a);`
+| Construct `bfloat16` from `sycl::half`. Converts `sycl::half` to `bfloat16`.
+
+| `bfloat16 &operator=(const sycl::half &a);`
+| Replace the value with `a` converted to `bfloat16`
+
+| `operator sycl::half() const;`
+|  Return `bfloat16` value converted to `sycl::half`.
 
 | `explicit operator bool() { /* ... */ }`
-| Convert `bfloat16` to `bool` type. Return `false` if the value equals to
+| Convert `bfloat16` to `bool` type. Return `false` if the `value` equals to
   zero, return `true` otherwise.
 
 | `friend bfloat16 operator-(bfloat16 &bf) { /* ... */ }`
@@ -253,85 +240,87 @@ Table 1. Member functions of `bfloat16` class.
 | Perform comparison operation OP between `lhs` `bfloat16` and `rhs` `bfloat16`
   values and return the result as a boolean value.
 
-OP is `==, !=, <, >, <=, >=`
+OP is `+==, !=, <, >, <=, >=+`
 
 | `template <typename T>
   friend bool operatorOP(const bfloat16 &lhs, const T &rhs) { /* ... */ }`
 | Perform comparison operation OP between `lhs` `bfloat16` and `rhs` of
   template type `T` and return the result as a boolean value. Type `T` must be
   convertible to `float`.
 
-OP is `==, !=, <, >, <=, >=`
+OP is `+==, !=, <, >, <=, >=+`
 
 | `template <typename T>
   friend bool operatorOP(const T &lhs, const bfloat16 &rhs) { /* ... */ }`
 | Perform comparison operation OP between `lhs` of template type `T` and `rhs`
   `bfloat16` value and return the result as a boolean value. Type `T` must be
   convertible to `float`.
 
-OP is `==, !=, <, >, <=, >=`
+OP is `+==, !=, <, >, <=, >=+`
 |===
 
 == Example
 
 [source]
 ----
 #include <sycl/sycl.hpp>
-#include <sycl/ext/oneapi/experimental/bfloat16.hpp>
 
-using sycl::ext::oneapi::experimental::bfloat16;
-
-bfloat16 operator+(const bfloat16 &lhs, const bfloat16 &rhs) {
-  return static_cast<float>(lhs) + static_cast<float>(rhs);
-}
+using namespace sycl;
+using sycl::ext::oneapi::bfloat16;
 
 float foo(float a, float b) {
   // Convert from float to bfloat16.
-  bfloat16 A {a};
-  bfloat16 B {b};
+  bfloat16 A{a};
+  bfloat16 B{b};
 
-  // Convert A and B from bfloat16 to float, do addition on floating-pointer
+  // Convert A and B from bfloat16 to float, do addition on floating-point
   // numbers, then convert the result to bfloat16 and store it in C.
   bfloat16 C = A + B;
 
   // Return the result converted from bfloat16 to float.
   return C;
 }
 
-int main (int argc, char *argv[]) {
+int main(int argc, char *argv[]) {
   float data[3] = {7.0, 8.1, 0.0};
-  sycl::device dev;
-  sycl::queue deviceQueue{dev};
-  sycl::buffer<float, 1> buf {data, sycl::range<1> {3}};
-
-  if (dev.has(sycl::aspect::ext_oneapi_bfloat16)) {
-    deviceQueue.submit ([&] (sycl::handler& cgh) {
-      auto numbers = buf.get_access<sycl::access::mode::read_write> (cgh);
-      cgh.single_task<class simple_kernel> ([=] () {
-        numbers[2] = foo(numbers[0], numbers[1]);
-      });
+  device dev{gpu_selector()};
+  queue deviceQueue{dev};
+  buffer<float, 1> buf{data, 3};
+
+  if (dev.has(aspect::ext_oneapi_bfloat16)) {
+    deviceQueue.submit([&](handler &cgh) {
+      accessor numbers{buf, cgh, read_write};
+      cgh.single_task([=]() { numbers[2] = foo(numbers[0], numbers[1]); });
     });
+  } else {
+    std::cout << "No bfloat16 support\n";
+    return 1;
   }
+  host_accessor hostOutAcc{buf, read_only};
+  std::cout << "Result = " << hostOutAcc[2] << std::endl;
   return 0;
 }
 ----
 
 == New bfloat16 math functions
 
-Many applications will require dedicated functions that take parameters of type `bfloat16`. This extension adds `bfloat16` support to the `fma`, `fmin`, `fmax` and `fabs` SYCL floating point math functions. These functions can be used as element wise operations on matrices, supplementing the `bfloat16` support in the sycl_ext_oneapi_matrix extension.
+Many applications will require dedicated functions that take parameters of type `bfloat16`. This extension adds `bfloat16` support to the `fma`, `fmin`, `fmax` and `fabs` SYCL floating point math functions. These functions can be used as element wise operations on matrices, supplementing the `bfloat16` support in the `sycl_ext_oneapi_matrix` extension.
 
 The descriptions of the `fma`, `fmin`, `fmax` and `fabs` SYCL floating point math functions can be found in the SYCL specification: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_math_functions.
 
-The following functions are only available when `T` is `bfloat16` or `sycl::marray<bfloat16, {N}>`, where `{N}` means any positive value of `size_t` type.
+
 
 === fma
 
 ```c++
-namespace sycl::ext::oneapi::experimental {
+namespace sycl::ext::oneapi {
 
-template <typename T>
-T fma(T a, T b, T c);
-} // namespace sycl::ext::oneapi::experimental
+bfloat16 fma(bfloat16 a, bfloat16 b, bfloat16 c);
+
+template<size_t N>
+marray<bfloat16, N> fma(marray<bfloat16, N> a, marray<bfloat16, N> b, marray<bfloat16, N> c);
+
+} // namespace sycl::ext::oneapi
 ```
 
 ==== Description
@@ -342,10 +331,14 @@ Rounding of intermediate products shall not occur. The mantissa LSB rounds to th
 === fmax
 
 ```c++
-namespace sycl::ext::oneapi::experimental {
-template <typename T>
-T fmax(T x, T y);
-} // namespace sycl::ext::oneapi::experimental
+namespace sycl::ext::oneapi {
+
+bfloat16 fmax(bfloat16 x, bfloat16 y);
+
+template<size_t N>
+marray<bfloat16, N> fmax(marray<bfloat16, N> x, marray<bfloat16, N> y);
+
+} // namespace sycl::ext::oneapi
 ```
 
 ==== Description
@@ -360,28 +353,34 @@ NaNs, `fmax()` returns a NaN.
 === fmin
 
 ```c++
-namespace sycl::ext::oneapi::experimental {
-template <typename T>
-T fmin(T x, T y);
-} // namespace sycl::ext::oneapi::experimental
+namespace sycl::ext::oneapi {
+
+bfloat16 fmin(bfloat16 a, bfloat16 b);
+
+template<size_t N>
+marray<bfloat16, N> fmin(marray<bfloat16, N> a, marray<bfloat16, N> b);
+
+} // namespace sycl::ext::oneapi
 ```
 
 ==== Description
 
 Returns `y` if
 `y < x`, otherwise it
 returns `x`. If one argument is a
-NaN, `fmax()` returns the other
+NaN, `fmin()` returns the other
 argument. If both arguments are
-NaNs, `fmax()` returns a NaN.
+NaNs, `fmin()` returns a NaN.
 
 === fabs
 
 ```c++
-namespace sycl::ext::oneapi::experimental {
+namespace sycl::ext::oneapi {
+
 template <typename T>
 T fabs(T x);
-} // namespace sycl::ext::oneapi::experimental
+
+} // namespace sycl::ext::oneapi
 ```
 
 ==== Description
@@ -408,4 +407,5 @@ Compute absolute value of a `bfloat16`.
 |3|2021-08-18|Alexey Sotkin |Remove `uint16_t` constructor
 |4|2022-03-07|Aidan Belton and Jack Kirk |Switch from Intel vendor specific to oneapi
 |5|2022-04-05|Jack Kirk | Added section for bfloat16 math builtins
+|6|2022-08-24|Rajiv Deodhar |Move bfloat16 from experimental to supported
 |========================================