Skip to content

Subgroup2 Benchmark #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Subgroup2 Benchmark #190

wants to merge 19 commits into from

Conversation

devshgraphicsprogramming
Copy link
Member

No description provided.

Comment on lines +13 to +93
template<typename T>
struct bit_and : nbl::hlsl::bit_and<T>
{
using base_t = nbl::hlsl::bit_and<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 0;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "bit_and";
#endif
};
template<typename T>
struct bit_or : nbl::hlsl::bit_or<T>
{
using base_t = nbl::hlsl::bit_or<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 1;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "bit_xor";
#endif
};
template<typename T>
struct bit_xor : nbl::hlsl::bit_xor<T>
{
using base_t = nbl::hlsl::bit_xor<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 2;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "bit_or";
#endif
};
template<typename T>
struct plus : nbl::hlsl::plus<T>
{
using base_t = nbl::hlsl::plus<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 3;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "plus";
#endif
};
template<typename T>
struct multiplies : nbl::hlsl::multiplies<T>
{
using base_t = nbl::hlsl::multiplies<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 4;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "multiplies";
#endif
};
template<typename T>
struct minimum : nbl::hlsl::minimum<T>
{
using base_t = nbl::hlsl::minimum<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 5;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "minimum";
#endif
};
template<typename T>
struct maximum : nbl::hlsl::maximum<T>
{
using base_t = nbl::hlsl::maximum<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 6;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "maximum";
#endif
};

template<typename T>
struct ballot : nbl::hlsl::plus<T>
{
using base_t = nbl::hlsl::plus<T>;

NBL_CONSTEXPR_STATIC_INLINE uint16_t BindingIndex = 7;
#ifndef __HLSL_VERSION
static inline constexpr const char* name = "bitcount";
#endif
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the benchmark lets only benchmark plus and ballot

static void subbench(NBL_CONST_REF_ARG(type_t) sourceVal)
{
using config_t = nbl::hlsl::subgroup::Configuration<SUBGROUP_SIZE_LOG2>;
using params_t = nbl::hlsl::subgroup2::ArithmeticParams<config_t, typename binop<T>::base_t, N, nbl::hlsl::jit::device_capabilities>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the "use-native" come from a define too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cause you want 2 different pipelines and 2 different test runs for each

[unroll]
for (uint32_t i = 0; i < ITEMS_PER_INVOCATION; i++)
{
sourceVal[i] = inputValue[idx + i];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eeh you've left perf on the table, your subgroup is now doing a heavily strided load

you need to load differently:

sourceVal[i] = nbl::hlsl::glsl::gl_WorkGroupID().x*WORKGROUP_SIZE*ITEMS_PER_INVOCATION+((SubgroupID*ITEMS_PER_INVOCATION+i)<<SubgroupSizeLog2)+SubgroupInvocationIndex;

so consecutive invocations load consecutive memory locations during a lock-step

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or make the inputValue a ByteBuffer and do a templated Load from it of a whole type_t

for (uint32_t i = 0; i < NUM_LOOPS; i++)
value = func(value);

output[binop<T>::BindingIndex].template Store<type_t>(sizeof(uint32_t) + sizeof(type_t) * globalIndex(), value);
Copy link
Member Author

@devshgraphicsprogramming devshgraphicsprogramming Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're storing to the wrong place, compute the index same way you compute for inputValue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and dont store whole vector store a scalar in aan unrolled loop same way you load input

@devshgraphicsprogramming devshgraphicsprogramming changed the title Subgroup2 Unit Test Subgroup2 Benchmark Apr 9, 2025
Comment on lines +66 to +72
subbench<bit_and, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
subbench<bit_xor, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
subbench<bit_or, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
subbench<plus, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
subbench<multiplies, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
subbench<minimum, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
subbench<maximum, uint32_t, ITEMS_PER_INVOCATION>(sourceVal);
Copy link
Member Author

@devshgraphicsprogramming devshgraphicsprogramming Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just bench the plus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Comment on lines +18 to +19
// because subgroups don't match `gl_LocalInvocationIndex` snake curve addressing, we also can't load inputs that way
uint32_t globalIndex();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to globalFirstItemIndex to mean that its the first index to access by the invocation, and add a comment that to get next item one does not scroll by 1 but by SubgroupSize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Comment on lines +164 to +178
// create dummy image
dummyImg = m_device->createImage({
{
.type = IGPUImage::ET_2D,
.samples = asset::ICPUImage::ESCF_1_BIT,
.format = asset::EF_R16G16B16A16_SFLOAT,
.extent = {WIN_W, WIN_H, 1},
.mipLevels = 1,
.arrayLayers = 1,
.flags = IImage::ECF_NONE,
.usage = core::bitflag(asset::IImage::EUF_STORAGE_BIT) | asset::IImage::EUF_TRANSFER_SRC_BIT
}
});
if (!dummyImg || !m_device->allocate(dummyImg->getMemoryReqs(), dummyImg.get()).isValid())
return logFail("Could not create HDR Image");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to create a dummy image to write into descriptor set, write the swapchain image instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump


virtual bool onAppTerminated() override
{
delete[] inputData;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't do testing, you can discard the input data as soon as input buffer is made, OR even better yet

Use a Hash from your path tracer / BxDF Math PR and skip having an input buffer entirely!

const uint32_t workgroupCount = elementCount / (set.workgroupSize * set.itemsPerInvocation);

cmdbuf->bindComputePipeline(set.pipeline.get());
cmdbuf->bindDescriptorSets(EPBP_COMPUTE, set.pipeline->getLayout(), 0u, 1u, &benchDs.get());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the set you can bind once at start of commandbuffer if you never change the layout

template<template<class> class Arithmetic>
BenchmarkSet createBenchmarkPipelines(const smart_refctd_ptr<const ICPUShader>&source, const IGPUPipelineLayout* layout, const uint32_t elementCount, const uint8_t subgroupSizeLog2, const uint32_t workgroupSize, uint32_t itemsPerInvoc = 1u, uint32_t numLoops = 8u)
{
std::string arith_name = Arithmetic<bit_xor<uint32_t>>::name; // TODO all operations
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only plus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump, plus and ballot are only ones needed, the benchmark wont change much for all other binops

Comment on lines 690 to 696
options.spirvOptimizer = nullptr;
//#ifndef _NBL_DEBUG
// ISPIRVOptimizer::E_OPTIMIZER_PASS optPasses = ISPIRVOptimizer::EOP_STRIP_DEBUG_INFO;
// auto opt = make_smart_refctd_ptr<ISPIRVOptimizer>(std::span<ISPIRVOptimizer::E_OPTIMIZER_PASS>(&optPasses, 1));
// options.spirvOptimizer = opt.get();
//#endif
options.debugInfoFlags |= IShaderCompiler::E_DEBUG_INFO_FLAGS::EDIF_LINE_BIT;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must use zero debug flags and a SPIR-V optimizer to get representable perf, just ask @Fletterio about his FFT examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the createShader method in main.cpp of example 28, but to get a standard optimizer + strip debug info you just provide an optimizer with a single strip debug info flag to the compiler.

I don't remember exactly whether this invokes other optimizations or just the strip debug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should invoke the standard passes specified in the SPIRV compiler repo when you run with -O, but in that regard I think the intent of the optimizer is busted (providing a custom optimizer likely is intended to disable all other passes by default)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keptsecret you correctly optimize and strip bedug info in 23 the test, but not ex 29 the benchmark?

Comment on lines +399 to +407
// bind dummy image
IGPUImageView::SCreationParams viewParams = {
.flags = IGPUImageView::ECF_NONE,
.subUsages = IGPUImage::E_USAGE_FLAGS::EUF_STORAGE_BIT,
.image = dummyImg,
.viewType = IGPUImageView::ET_2D,
.format = dummyImg->getCreationParameters().format
};
auto dummyImgView = m_device->createImageView(std::move(viewParams));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not create an image view every frame

Comment on lines +374 to +397
// barrier transition to GENERAL
{
IGPUCommandBuffer::SPipelineBarrierDependencyInfo::image_barrier_t imageBarriers[1];
imageBarriers[0].barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::NONE,
.srcAccessMask = ACCESS_FLAGS::NONE,
.dstStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.dstAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS
}
};
imageBarriers[0].image = dummyImg.get();
imageBarriers[0].subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
};
imageBarriers[0].oldLayout = IImage::LAYOUT::UNDEFINED;
imageBarriers[0].newLayout = IImage::LAYOUT::GENERAL;

cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imageBarriers });
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't actually touch the image, you don't need to transition it (you may need to transition right after creation / first frame so validation layer doesn't complain about it being in UNDEFINED layout)

Comment on lines +409 to +423
video::IGPUDescriptorSet::SDescriptorInfo dsInfo;
dsInfo.info.image.imageLayout = IImage::LAYOUT::GENERAL;
dsInfo.desc = dummyImgView;

IGPUDescriptorSet::SWriteDescriptorSet dsWrites[1u] =
{
{
.dstSet = benchDs.get(),
.binding = 2u,
.arrayElement = 0u,
.count = 1u,
.info = &dsInfo,
}
};
m_device->updateDescriptorSets(1u, dsWrites, 0u, nullptr);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't write descriptor set every frame, and write the swapchain images instead

Comment on lines 432 to 434
passed = runBenchmark<emulatedScanInclusive>(cmdbuf, benchSets[0], elementCount, SubgroupSizeLog2);
passed = runBenchmark<emulatedScanInclusive>(cmdbuf, benchSets[1], elementCount, SubgroupSizeLog2);
passed = runBenchmark<emulatedScanInclusive>(cmdbuf, benchSets[2], elementCount, SubgroupSizeLog2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the CPU validator emulatedScanInclusive for anything, untemplate the method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Comment on lines +437 to +500
// blit
{
IGPUCommandBuffer::SPipelineBarrierDependencyInfo::image_barrier_t imageBarriers[2];
imageBarriers[0].barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.srcAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS,
.dstStageMask = PIPELINE_STAGE_FLAGS::BLIT_BIT,
.dstAccessMask = ACCESS_FLAGS::TRANSFER_WRITE_BIT
}
};
imageBarriers[0].image = dummyImg.get();
imageBarriers[0].subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
};
imageBarriers[0].oldLayout = IImage::LAYOUT::UNDEFINED;
imageBarriers[0].newLayout = IImage::LAYOUT::TRANSFER_SRC_OPTIMAL;

imageBarriers[1].barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::NONE,
.srcAccessMask = ACCESS_FLAGS::NONE,
.dstStageMask = PIPELINE_STAGE_FLAGS::BLIT_BIT,
.dstAccessMask = ACCESS_FLAGS::TRANSFER_WRITE_BIT
}
};
imageBarriers[1].image = m_surface->getSwapchainResources()->getImage(m_currentImageAcquire.imageIndex);
imageBarriers[1].subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
};
imageBarriers[1].oldLayout = IImage::LAYOUT::UNDEFINED;
imageBarriers[1].newLayout = IImage::LAYOUT::TRANSFER_DST_OPTIMAL;

cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imageBarriers });
}

{
IGPUCommandBuffer::SImageBlit regions[] = { {
.srcMinCoord = {0,0,0},
.srcMaxCoord = {WIN_W,WIN_H,1},
.dstMinCoord = {0,0,0},
.dstMaxCoord = {WIN_W,WIN_H,1},
.layerCount = 1,
.srcBaseLayer = 0,
.dstBaseLayer = 0,
.srcMipLevel = 0,
.dstMipLevel = 0,
.aspectMask = IGPUImage::E_ASPECT_FLAGS::EAF_COLOR_BIT
} };

auto srcImg = dummyImg.get();
auto scRes = static_cast<CDefaultSwapchainFramebuffers*>(m_surface->getSwapchainResources());
auto dstImg = scRes->getImage(m_currentImageAcquire.imageIndex);

cmdbuf->blitImage(srcImg, IImage::LAYOUT::TRANSFER_SRC_OPTIMAL, dstImg, IImage::LAYOUT::TRANSFER_DST_OPTIMAL, regions, ISampler::ETF_NEAREST);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blit is not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have the swapchain images bound and you layout transition them, Nsight should still consider them used, please remove the blit

Comment on lines +502 to +525
// barrier transition to PRESENT
{
IGPUCommandBuffer::SPipelineBarrierDependencyInfo::image_barrier_t imageBarriers[1];
imageBarriers[0].barrier = {
.dep = {
.srcStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
.srcAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS,
.dstStageMask = PIPELINE_STAGE_FLAGS::NONE,
.dstAccessMask = ACCESS_FLAGS::NONE
}
};
imageBarriers[0].image = m_surface->getSwapchainResources()->getImage(m_currentImageAcquire.imageIndex);
imageBarriers[0].subresourceRange = {
.aspectMask = IImage::EAF_COLOR_BIT,
.baseMipLevel = 0u,
.levelCount = 1u,
.baseArrayLayer = 0u,
.layerCount = 1u
};
imageBarriers[0].oldLayout = IImage::LAYOUT::TRANSFER_DST_OPTIMAL;
imageBarriers[0].newLayout = IImage::LAYOUT::PRESENT_SRC;

cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imageBarriers });
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transition once to PRESENT and never touch later on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if nsight needs some barrier usage of the swapchain in order to capture, I don't think it does

@keptsecret
Copy link
Contributor

Should close, replace with #192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants