Consider optimizing the API server

Consider optimizing the FastAPI/OpenAI API server in vLLM as the server is widely used and seems to have a lot of overhead. On 1xA100 Llama 13B, the `LLM` class reaches 90~100% GPU utilization, while the API server can only utilize 50%

Related: https://github.com/vllm-project/vllm/discussions/459