Skip to content

Unit tests and benchmark for subgroup2 and workgroup2 stuff #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 76 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
8090a2d
initial benchmark example copy
keptsecret Mar 27, 2025
3a2ff14
test subgroup2 funcs correct
keptsecret Mar 31, 2025
dd021a0
fix test
keptsecret Mar 31, 2025
ca21941
benchmarking shader + pipeline working
keptsecret Apr 1, 2025
0bb41db
begin adding fake frames for nsight profiler
keptsecret Apr 2, 2025
24a93bb
merge master, fix conflicts
keptsecret Apr 7, 2025
17dda8e
re-numbered example to avoid duplicate
keptsecret Apr 7, 2025
3d4e0f2
fake frames for nsight
keptsecret Apr 8, 2025
0192999
use correct shader, spirv line dbinfo for nsight
keptsecret Apr 8, 2025
8c9d55e
support for 1 item per invoc
keptsecret Apr 8, 2025
07d6980
handle when items per invoc =1
keptsecret Apr 9, 2025
be756d5
minor fixes
keptsecret Apr 10, 2025
1963b51
changes in Param, Config usage
keptsecret Apr 10, 2025
99cf5d8
coalesced load/store data
keptsecret Apr 21, 2025
1d5e433
Merge branch 'master' into scan_perf_bench
keptsecret Apr 21, 2025
a3bb526
fixed some bugs
keptsecret Apr 21, 2025
355c605
disable test by default
keptsecret Apr 21, 2025
6b57674
refactor to load data as vectors, consecutive uints
keptsecret Apr 25, 2025
7da1bec
initial wg scan test
keptsecret Apr 28, 2025
750b3d2
working? test for workgroup2 reduce
keptsecret Apr 28, 2025
f11b3df
fixes to test
keptsecret Apr 29, 2025
9f690ee
tests with multiple items per invoc
keptsecret Apr 29, 2025
755f89a
inclusive scan test
keptsecret Apr 29, 2025
b8415ad
exclusive scan test, remove comments
keptsecret Apr 30, 2025
474281d
benchmark shader, new common header
keptsecret May 1, 2025
7d06332
test smaller workgroup sizes
keptsecret May 2, 2025
874557c
expanded scratch proxy funcs
keptsecret May 2, 2025
28ea75f
simplify scratch,proxy to just scalar types
keptsecret May 5, 2025
e8c2831
move all tests into new example
keptsecret May 7, 2025
93b4d0b
Merge branch 'master' into new_wg_scan_test
keptsecret May 7, 2025
2ba2b82
workgroup scan benchmark, renamed examples
keptsecret May 7, 2025
d567e71
removed obsolete files
keptsecret May 7, 2025
54acf2a
replaced old ex 23 unit test with new tests
keptsecret May 7, 2025
030d622
minor fixes
keptsecret May 7, 2025
ca71a39
minor fixes to workgroup benchmark
keptsecret May 8, 2025
6018e9a
more minor fixes
keptsecret May 8, 2025
3a9758c
some fixes to using config vars
keptsecret May 9, 2025
e496e98
fixes to test mem errors
keptsecret May 12, 2025
20011f5
config struct changes
keptsecret May 12, 2025
4a951b3
more test case coverage
keptsecret May 14, 2025
a42a742
Merge branch 'master' into new_wg_scan_test
keptsecret May 15, 2025
908abd1
refactor name changes
keptsecret May 15, 2025
81238ad
minor refactor
keptsecret May 15, 2025
749658f
manage workgroup in example
keptsecret May 15, 2025
1de31dd
moved benchmark to ex 29
keptsecret May 15, 2025
e828dc4
fit accessors to concept
keptsecret May 16, 2025
086c21e
use bda in unit test
keptsecret May 20, 2025
f4af3ed
benchmarks use bda
keptsecret May 20, 2025
a394f22
use data accessor with preload data in reg
keptsecret May 20, 2025
44c34a8
use store with data type because it works now
keptsecret May 20, 2025
0ccd26f
save reduction returns to storage
keptsecret May 21, 2025
2a991a9
combined headers between subgroup, workgroup stuff, restored spirv ca…
keptsecret May 22, 2025
e4735a4
simplified test,benchmark function template params
keptsecret May 22, 2025
13ae89f
revert test to default params
keptsecret May 22, 2025
a8774db
use preloaded data in benchmark
keptsecret May 22, 2025
bb3a901
Merge branch 'master' into new_wg_scan_test
keptsecret May 26, 2025
2a85f4e
refactor config member name
keptsecret May 27, 2025
99f6dfe
fit new accessor concepts
keptsecret May 27, 2025
3d89894
fix template accessors
keptsecret May 27, 2025
3d63ed7
add accessor index template type
keptsecret May 27, 2025
1100876
limit workgroup count
keptsecret May 28, 2025
f202ef5
utility func to get items per wg
keptsecret May 29, 2025
93b7810
added check for vk spec requirement
keptsecret May 30, 2025
3a3aaa9
removed maxComputeWorkgroupSubgroups*subgroupsize check
keptsecret Jun 2, 2025
6581ed4
Merge branch 'master' into new_wg_scan_test
keptsecret Jun 2, 2025
90ba926
various minor adjustments to unit tests
keptsecret Jun 5, 2025
19d7fe0
simplified data accessors
keptsecret Jun 5, 2025
fdace31
tests for native and emulated subgroup op
keptsecret Jun 5, 2025
d6680f2
removed redundant stuff
keptsecret Jun 5, 2025
bafad3e
bind swapchain image directly, explicit surface format swapchain
keptsecret Jun 6, 2025
32dc78f
shared data accessor header between test and bench, same shader adjus…
keptsecret Jun 6, 2025
2aef6d3
generate benchmark inputs with xoroshiro
keptsecret Jun 6, 2025
149a237
only have to benchmark plus op
keptsecret Jun 6, 2025
00ed9be
benchmark all reduce/scan in one run (lots of shaders)
keptsecret Jun 6, 2025
a5a21fd
minor changes to passing subgroup size and items per wg
keptsecret Jun 9, 2025
1710b69
push constant stores array of output addresses directly because stati…
keptsecret Jun 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
#include "nbl/builtin/hlsl/cpp_compat.hlsl"
#include "nbl/builtin/hlsl/functional.hlsl"

template<uint32_t kScanElementCount=1024*1024>
struct Output
struct PushConstantData
{
NBL_CONSTEXPR_STATIC_INLINE uint32_t ScanElementCount = kScanElementCount;

uint32_t subgroupSize;
uint32_t data[ScanElementCount];
uint64_t pInputBuf;
uint64_t pOutputBuf[8];
};

namespace arithmetic
{
// Thanks to our unified HLSL/C++ STD lib we're able to remove a whole load of code
template<typename T>
struct bit_and : nbl::hlsl::bit_and<T>
Expand Down Expand Up @@ -92,5 +91,6 @@ struct ballot : nbl::hlsl::plus<T>
static inline constexpr const char* name = "bitcount";
#endif
};
}

#include "nbl/builtin/hlsl/subgroup/basic.hlsl"
#include "nbl/builtin/hlsl/glsl_compat/subgroup_basic.hlsl"
30 changes: 30 additions & 0 deletions 23_Arithmetic2UnitTest/app_resources/shaderCommon.hlsl
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#include "common.hlsl"

using namespace nbl;
using namespace hlsl;

// https://github.com/microsoft/DirectXShaderCompiler/issues/6144
uint32_t3 nbl::hlsl::glsl::gl_WorkGroupSize() {return uint32_t3(WORKGROUP_SIZE,1,1);}

#ifndef ITEMS_PER_INVOCATION
#error "Define ITEMS_PER_INVOCATION!"
#endif

[[vk::push_constant]] PushConstantData pc;

struct device_capabilities
{
#ifdef TEST_NATIVE
NBL_CONSTEXPR_STATIC_INLINE bool shaderSubgroupArithmetic = true;
#else
NBL_CONSTEXPR_STATIC_INLINE bool shaderSubgroupArithmetic = false;
#endif
};

#ifndef OPERATION
#error "Define OPERATION!"
#endif

#ifndef SUBGROUP_SIZE_LOG2
#error "Define SUBGROUP_SIZE_LOG2!"
#endif
55 changes: 55 additions & 0 deletions 23_Arithmetic2UnitTest/app_resources/testSubgroup.comp.hlsl
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#pragma shader_stage(compute)

#define operation_t nbl::hlsl::OPERATION

#include "nbl/builtin/hlsl/glsl_compat/core.hlsl"
#include "nbl/builtin/hlsl/glsl_compat/subgroup_basic.hlsl"
#include "nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl"

#include "shaderCommon.hlsl"
#include "nbl/builtin/hlsl/workgroup/basic.hlsl"

typedef vector<uint32_t, ITEMS_PER_INVOCATION> type_t;

uint32_t globalIndex()
{
return glsl::gl_WorkGroupID().x*WORKGROUP_SIZE+workgroup::SubgroupContiguousIndex();
}

template<class Binop, uint32_t N>
static void subtest(NBL_CONST_REF_ARG(type_t) sourceVal)
{
using config_t = subgroup2::Configuration<SUBGROUP_SIZE_LOG2>;
using params_t = subgroup2::ArithmeticParams<config_t, typename Binop::base_t, N, device_capabilities>;

const uint64_t outputBufAddr = pc.pOutputBuf[Binop::BindingIndex];

if (glsl::gl_SubgroupSize()!=1u<<SUBGROUP_SIZE_LOG2)
vk::RawBufferStore<uint32_t>(outputBufAddr, glsl::gl_SubgroupSize());
Comment on lines +27 to +28

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again


operation_t<params_t> func;
type_t val = func(sourceVal);

vk::RawBufferStore<type_t>(outputBufAddr + sizeof(uint32_t) + sizeof(type_t) * globalIndex(), val, sizeof(uint32_t));
}

type_t test()
{
const uint32_t idx = globalIndex();
type_t sourceVal = vk::RawBufferLoad<type_t>(pc.pInputBuf + idx * sizeof(type_t));

subtest<arithmetic::bit_and<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
subtest<arithmetic::bit_xor<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
subtest<arithmetic::bit_or<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
subtest<arithmetic::plus<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
subtest<arithmetic::multiplies<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
subtest<arithmetic::minimum<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
subtest<arithmetic::maximum<uint32_t>, ITEMS_PER_INVOCATION>(sourceVal);
return sourceVal;
}

[numthreads(WORKGROUP_SIZE,1,1)]
void main()
{
test();
}
76 changes: 76 additions & 0 deletions 23_Arithmetic2UnitTest/app_resources/testWorkgroup.comp.hlsl
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#pragma shader_stage(compute)

#include "nbl/builtin/hlsl/glsl_compat/core.hlsl"
#include "nbl/builtin/hlsl/glsl_compat/subgroup_basic.hlsl"
#include "nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl"
#include "nbl/builtin/hlsl/workgroup2/arithmetic.hlsl"

static const uint32_t WORKGROUP_SIZE = 1u << WORKGROUP_SIZE_LOG2;

#include "shaderCommon.hlsl"

using config_t = workgroup2::ArithmeticConfiguration<WORKGROUP_SIZE_LOG2, SUBGROUP_SIZE_LOG2, ITEMS_PER_INVOCATION>;

typedef vector<uint32_t, config_t::ItemsPerInvocation_0> type_t;

// final (level 1/2) scan needs to fit in one subgroup exactly
groupshared uint32_t scratch[mpl::max_v<int16_t,config_t::SharedScratchElementCount,1>];

#include "../../common/include/WorkgroupDataAccessors.hlsl"

static ScratchProxy arithmeticAccessor;

template<class Binop, class device_capabilities>
struct operation_t
{
using binop_base_t = typename Binop::base_t;
using otype_t = typename Binop::type_t;

// workgroup reduction returns the value of the reduction
// workgroup scans do no return anything, but use the data accessor to do the storing directly
void operator()()
{
PreloadedDataProxy<config_t,Binop> dataAccessor = PreloadedDataProxy<config_t,Binop>::create();
dataAccessor.preload();
#if IS_REDUCTION
otype_t value =
#endif
OPERATION<config_t,binop_base_t,device_capabilities>::template __call<PreloadedDataProxy<config_t,Binop>, ScratchProxy>(dataAccessor,arithmeticAccessor);
// we barrier before because we alias the accessors for Binop
arithmeticAccessor.workgroupExecutionAndMemoryBarrier();
#if IS_REDUCTION
[unroll]
for (uint32_t i = 0; i < PreloadedDataProxy<config_t,Binop>::PreloadedDataCount; i++)
dataAccessor.preloaded[i] = value;
#endif
dataAccessor.unload();
}
};


template<class Binop>
static void subtest()
{
if (glsl::gl_SubgroupSize()!=1u<<SUBGROUP_SIZE_LOG2)
vk::RawBufferStore<uint32_t>(pc.pOutputBuf[Binop::BindingIndex], glsl::gl_SubgroupSize());
Comment on lines +54 to +55

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just assert, have you seen our assert() in HLSL ?


operation_t<Binop,device_capabilities> func;
func();
}

void test()
{
subtest<arithmetic::bit_and<uint32_t> >();
subtest<arithmetic::bit_xor<uint32_t> >();
subtest<arithmetic::bit_or<uint32_t> >();
subtest<arithmetic::plus<uint32_t> >();
subtest<arithmetic::multiplies<uint32_t> >();
subtest<arithmetic::minimum<uint32_t> >();
subtest<arithmetic::maximum<uint32_t> >();
}

[numthreads(WORKGROUP_SIZE,1,1)]
void main()
{
test();
}
Loading