Will update the profiling results in this PR. ## BS=8, input_len=32, output_len=128 ``` OPT-13B TP 1: 3.5404738585154214 seconds TP 2: 4.742188215255737 seconds TP 4: 4.907034238179524 seconds OPT-30B TP 1: OOM TP 2: 5.9848620891571045 seconds TP 4: 5.943212985992432 seconds ```