Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Operator] Add vstack op [MooreThreads] #175

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

yjl0101
Copy link
Contributor

@yjl0101 yjl0101 commented Aug 22, 2024

add vstack operator
perf of some cases on NV A100:

benchmark/test_special_perf.py Operator vstack Performance Test (torch.float16)
Size        Torch Latency (ms)   Gems Latency (ms)
--------------------------------------------------
1024                  0.017408            0.011264
6144                  0.057344            0.032768
11264                  0.09728            0.053248
16384                 0.136192            0.070656
21504                 0.177152            0.090112
26624                 0.217088             0.10752
31744                 0.257024            0.125952
36864                 0.297984            0.144384
41984                 0.338944            0.162816
47104                 0.379904            0.181248
52224                 0.420864             0.19968
57344                 0.461824            0.218112
62464                 0.502784            0.236544
67584                 0.544768            0.254976
72704                 0.585728            0.273408
77824                 0.627712             0.29184
Operator vstack Performance Test (torch.float32)
Size        Torch Latency (ms)   Gems Latency (ms)
--------------------------------------------------
1024                  0.016384             0.01536
6144                  0.059392             0.05632
11264                 0.098304            0.093184
16384                 0.137216            0.130048
21504                 0.177152            0.166912
26624                 0.217088            0.203776
31744                 0.258048             0.24064
36864                 0.297984            0.277504
41984                 0.336896            0.314368
47104                 0.376832            0.351232
52224                 0.417792             0.38912
57344                 0.459776            0.425984
62464                 0.500736            0.462848
67584                 0.539648            0.499712
72704                 0.580608            0.535552
77824                  0.61952             0.57344
Operator vstack Performance Test (torch.bfloat16)
Size        Torch Latency (ms)   Gems Latency (ms)
--------------------------------------------------
1024                  0.017408            0.012288
6144                  0.058368            0.033792
11264                  0.09728            0.053248
16384                 0.137216             0.07168
21504                 0.177152            0.090112
26624                 0.217088            0.108544
31744                 0.257024            0.125952
36864                 0.297984            0.144384
41984                 0.338944            0.162816
47104                 0.379904            0.181248
52224                 0.420864             0.19968
57344                 0.461824            0.218112
62464                 0.503808            0.236544
67584                 0.544768            0.254976
72704                 0.585728            0.273408
77824                 0.627712             0.29184

@yjl0101 yjl0101 force-pushed the dev_vstack branch 2 times, most recently from dddc647 to 5abf8de Compare September 2, 2024 03:23
@iclementine iclementine self-assigned this Sep 10, 2024
@yjl0101 yjl0101 force-pushed the dev_vstack branch 5 times, most recently from eb04e1d to 5ad2b7c Compare September 18, 2024 06:49
Copy link
Collaborator

@iclementine iclementine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +116 to +119
grid = lambda META: (
triton.cdiv(max_tile_elems, META["BLOCK_SIZE"]),
scheduled_num_tensors,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the 4 tensors to be concatenated in this iteration have very varying number of rows, this grid may have many CTAs doing nothing. Do you have some test about the performance at this case? Maybe we can sort this tensors according to their number of rows.

Also, maybe this strategy is worth only when the number of tensors to vstack is large enough? But it is a good idea to take 4 tensors a time, compared to a naive one-by-one strategy.

@iclementine iclementine merged commit f4b2495 into FlagOpen:master Sep 20, 2024
4 checks passed
DuanYaQi pushed a commit that referenced this pull request Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants