Refactor norm_linear into struct functor #425

NorthmanPKU · 2025-07-24T04:35:57Z

Description of changes:
Refactor norm_linear kernel into struct functor style.

NormLinearKernelSpec: Maintain compile-time configuration and constants
ProcessAtomFunctor: Core computation logic for processing one single OUTPUT_ATOM_SIZE output tiles
NormLinearHandler: Top-level control flow and memory management. Later can have more fine-grained functions

Now the style to run norm_linear kernel is as:

using KernelSpec = NormLinearKernelSpec<T, BATCH_SIZE, OUTPUT_SIZE, REDUCTION_SIZE, O_STRIDE, K_PIPE_MAX>;
NormLinearHandler<KernelSpec> handler(input_ptr, norm_weight_ptr, weight_ptr, eps, output_ptr);
handler.run();

The refactored code has exactly the same register usage as the old one (123). Also, no efficiency difference has been observed:
ptxas info : Used 123 registers, used 1 barriers, 392 bytes cmem[0]

An imagination of what we could do in the future:

using KernelSpec = NormLinearKernelSpec<T, BATCH_SIZE, OUTPUT_SIZE, REDUCTION_SIZE, O_STRIDE, K_PIPE_MAX>;
NormLinearHandler<KernelSpec> handler(input_ptr, norm_weight_ptr, weight_ptr, eps, output_ptr);
handler.load_independent_data(); // inter-layer overlap
/* Some sync logic*/
handler.main_logic();

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

jiazhihao · 2025-07-24T20:02:14Z

include/mirage/persistent_kernel/runtime_header.h

Maybe this is because __CUDA_ARCH__ wasn't passed into this header file.

Seems like __CUDA_ARCH__ can only be used in the implementation of GPU functions

I think we could use cudaGetDeviceProperties in host side to get & set the maximum smem size. And use CUDA_ARCH in device side to do so.

jiazhihao · 2025-07-24T20:02:35Z

@NorthmanPKU Is this PR ready for review?

NorthmanPKU · 2025-07-24T22:15:02Z

@NorthmanPKU Is this PR ready for review?

Yes

jiazhihao · 2025-10-11T01:06:46Z

@NorthmanPKU Do we still want to merge this?

refactor norm_linear into struct functor

c4c650b

jiazhihao reviewed Jul 24, 2025

View reviewed changes

NorthmanPKU added 2 commits July 25, 2025 02:15

merge mpk

5e853ed

add static

84b02df

NorthmanPKU mentioned this pull request Jul 25, 2025

set MAX_SHARE_MEMORY_SIZE based off compute capability #412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor norm_linear into struct functor #425

Refactor norm_linear into struct functor #425

NorthmanPKU commented Jul 24, 2025 •

edited

Loading

Uh oh!

jiazhihao Jul 24, 2025

Uh oh!

NorthmanPKU Jul 24, 2025 •

edited

Loading

Uh oh!

NorthmanPKU Jul 24, 2025 •

edited

Loading

Uh oh!

jiazhihao commented Jul 24, 2025

Uh oh!

NorthmanPKU commented Jul 24, 2025

Uh oh!

jiazhihao commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor norm_linear into struct functor #425

Are you sure you want to change the base?

Refactor norm_linear into struct functor #425

Conversation

NorthmanPKU commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiazhihao Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

NorthmanPKU Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NorthmanPKU Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiazhihao commented Jul 24, 2025

Uh oh!

NorthmanPKU commented Jul 24, 2025

Uh oh!

jiazhihao commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NorthmanPKU commented Jul 24, 2025 •

edited

Loading

NorthmanPKU Jul 24, 2025 •

edited

Loading

NorthmanPKU Jul 24, 2025 •

edited

Loading