[V1] [Performance Benchmark] Benchmark the performance of Speculative Decoding

1. Let's start with ngram, can you collect both latency and throughput numbers on ShareGPT dataset on H100 and one low end GPU?
2. If the numbers from 1 is not expected, could you run some profiling to understand the performance bottleneck.
3. Get more performance numbers on other datasets.