Skip to content

Conversation

NorthmanPKU
Copy link
Collaborator

@NorthmanPKU NorthmanPKU commented Jul 24, 2025

Description of changes:
Refactor norm_linear kernel into struct functor style.

NormLinearKernelSpec: Maintain compile-time configuration and constants
ProcessAtomFunctor: Core computation logic for processing one single OUTPUT_ATOM_SIZE output tiles
NormLinearHandler: Top-level control flow and memory management. Later can have more fine-grained functions

Now the style to run norm_linear kernel is as:

using KernelSpec = NormLinearKernelSpec<T, BATCH_SIZE, OUTPUT_SIZE, REDUCTION_SIZE, O_STRIDE, K_PIPE_MAX>;
NormLinearHandler<KernelSpec> handler(input_ptr, norm_weight_ptr, weight_ptr, eps, output_ptr);
handler.run();

The refactored code has exactly the same register usage as the old one (123). Also, no efficiency difference has been observed:
ptxas info : Used 123 registers, used 1 barriers, 392 bytes cmem[0]

An imagination of what we could do in the future:

using KernelSpec = NormLinearKernelSpec<T, BATCH_SIZE, OUTPUT_SIZE, REDUCTION_SIZE, O_STRIDE, K_PIPE_MAX>;
NormLinearHandler<KernelSpec> handler(input_ptr, norm_weight_ptr, weight_ptr, eps, output_ptr);
handler.load_independent_data(); // inter-layer overlap
/* Some sync logic*/
handler.main_logic();

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is because __CUDA_ARCH__ wasn't passed into this header file.

Copy link
Collaborator Author

@NorthmanPKU NorthmanPKU Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like __CUDA_ARCH__ can only be used in the implementation of GPU functions

Copy link
Collaborator Author

@NorthmanPKU NorthmanPKU Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use cudaGetDeviceProperties in host side to get & set the maximum smem size. And use CUDA_ARCH in device side to do so.

@jiazhihao
Copy link
Member

@NorthmanPKU Is this PR ready for review?

@NorthmanPKU
Copy link
Collaborator Author

@NorthmanPKU Is this PR ready for review?

Yes

@jiazhihao
Copy link
Member

@NorthmanPKU Do we still want to merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants