[Operator] Add vstack op [MooreThreads] #175

yjl0101 · 2024-08-22T12:00:20Z

add vstack operator
perf of some cases on NV A100:

benchmark/test_special_perf.py Operator vstack Performance Test (torch.float16)
Size        Torch Latency (ms)   Gems Latency (ms)
--------------------------------------------------
1024                  0.017408            0.011264
6144                  0.057344            0.032768
11264                  0.09728            0.053248
16384                 0.136192            0.070656
21504                 0.177152            0.090112
26624                 0.217088             0.10752
31744                 0.257024            0.125952
36864                 0.297984            0.144384
41984                 0.338944            0.162816
47104                 0.379904            0.181248
52224                 0.420864             0.19968
57344                 0.461824            0.218112
62464                 0.502784            0.236544
67584                 0.544768            0.254976
72704                 0.585728            0.273408
77824                 0.627712             0.29184
Operator vstack Performance Test (torch.float32)
Size        Torch Latency (ms)   Gems Latency (ms)
--------------------------------------------------
1024                  0.016384             0.01536
6144                  0.059392             0.05632
11264                 0.098304            0.093184
16384                 0.137216            0.130048
21504                 0.177152            0.166912
26624                 0.217088            0.203776
31744                 0.258048             0.24064
36864                 0.297984            0.277504
41984                 0.336896            0.314368
47104                 0.376832            0.351232
52224                 0.417792             0.38912
57344                 0.459776            0.425984
62464                 0.500736            0.462848
67584                 0.539648            0.499712
72704                 0.580608            0.535552
77824                  0.61952             0.57344
Operator vstack Performance Test (torch.bfloat16)
Size        Torch Latency (ms)   Gems Latency (ms)
--------------------------------------------------
1024                  0.017408            0.012288
6144                  0.058368            0.033792
11264                  0.09728            0.053248
16384                 0.137216             0.07168
21504                 0.177152            0.090112
26624                 0.217088            0.108544
31744                 0.257024            0.125952
36864                 0.297984            0.144384
41984                 0.338944            0.162816
47104                 0.379904            0.181248
52224                 0.420864             0.19968
57344                 0.461824            0.218112
62464                 0.503808            0.236544
67584                 0.544768            0.254976
72704                 0.585728            0.273408
77824                 0.627712             0.29184

tests/test_special_ops.py

src/flag_gems/ops/vstack.py

tests/test_special_ops.py

iclementine

LGTM

iclementine · 2024-09-13T01:57:56Z

src/flag_gems/ops/vstack.py

+        grid = lambda META: (
+            triton.cdiv(max_tile_elems, META["BLOCK_SIZE"]),
+            scheduled_num_tensors,
+        )


When the 4 tensors to be concatenated in this iteration have very varying number of rows, this grid may have many CTAs doing nothing. Do you have some test about the performance at this case? Maybe we can sort this tensors according to their number of rows.

Also, maybe this strategy is worth only when the number of tensors to vstack is large enough? But it is a good idea to take 4 tensors a time, compared to a naive one-by-one strategy.

yjl0101 force-pushed the dev_vstack branch from f2a16d8 to 442760d Compare August 28, 2024 03:29

zhzhcookie reviewed Aug 30, 2024

View reviewed changes

tests/test_special_ops.py Outdated Show resolved Hide resolved

zhzhcookie reviewed Aug 30, 2024

View reviewed changes

src/flag_gems/ops/vstack.py Outdated Show resolved Hide resolved

yjl0101 force-pushed the dev_vstack branch 2 times, most recently from dddc647 to 5abf8de Compare September 2, 2024 03:23

iclementine self-assigned this Sep 10, 2024

iclementine reviewed Sep 12, 2024

View reviewed changes

tests/test_special_ops.py Outdated Show resolved Hide resolved

yjl0101 force-pushed the dev_vstack branch 5 times, most recently from eb04e1d to 5ad2b7c Compare September 18, 2024 06:49

[Operator] Add vstack op

417e254

yjl0101 force-pushed the dev_vstack branch from 5ad2b7c to 417e254 Compare September 18, 2024 06:59

iclementine approved these changes Sep 20, 2024

View reviewed changes

iclementine merged commit f4b2495 into FlagOpen:master Sep 20, 2024
4 checks passed

DuanYaQi pushed a commit that referenced this pull request Oct 8, 2024

[Operator] Add vstack op (#175)

46211d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] Add vstack op [MooreThreads] #175

[Operator] Add vstack op [MooreThreads] #175

yjl0101 commented Aug 22, 2024

iclementine left a comment

iclementine Sep 13, 2024

[Operator] Add vstack op [MooreThreads] #175

[Operator] Add vstack op [MooreThreads] #175

Conversation

yjl0101 commented Aug 22, 2024

iclementine left a comment

Choose a reason for hiding this comment

iclementine Sep 13, 2024

Choose a reason for hiding this comment