portable DP4A function

### Is your feature request related to a problem? Please describe

https://github.com/intel/llvm/discussions/3682

### Describe the solution you would like

sycl::dp4a(a, b, c)

### Describe alternatives you have considered

```
static __device__ __forceinline__ int __dp4a(const int a, const int b, int c) {
#if __has_builtin(__builtin_amdgcn_sdot4)
    c = __builtin_amdgcn_sdot4(a, b, c, false);
#else
    const int8x4_t va = reinterpret_cast<const int8x4_t&>(a);
    const int8x4_t vb = reinterpret_cast<const int8x4_t&>(b);
    c += va[0] * vb[0] + va[1] * vb[1] + va[2] * vb[2] + va[3] * vb[3];
#endif
    return c;
}
```

### Additional context

https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

portable DP4A function #16848

Is your feature request related to a problem? Please describe

Describe the solution you would like

Describe alternatives you have considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

portable DP4A function #16848

Description

Is your feature request related to a problem? Please describe

Describe the solution you would like

Describe alternatives you have considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions