-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding OpenAI Compatible RESTful API #317
Conversation
@microsoft-github-policy-service agree |
Hi! first of all thank you for such a feature, it will be extremely useful to deploy models with
|
Looks like it is better to implement the transformer's chat template instead of the fastchat's chat templates, thanks for the suggestion, I forgot about the transformer's chat templates 😅 I will update it to use the transformer's chat templates instead |
@PawanOsman I had to do a bit more work on this since it was mentioned, and there is a newer PR that was merged yesterday to vLLM adding HF chat template support: vllm-project/vllm#1756 Just mentioning it, because there were some decisions that had to be made to support it, and it'd be best if inference servers remained compatible as best they can. |
Hi @PawanOsman thank you for this amazing contribution! Can you let me know when it is ready for review and I will work with you to get it merged? (It is currently still marked as a "draft") |
Thanks for mentioning that 🙏 |
Been really tied up lately, but I'm pushing to get this sorted out as quickly as possible. Will update you soon on the progress |
which previous values prevents the model from generating text Fix: API keys not being passed to the app_settings Fix: Counting prompt tokens
Does it support parallel tensor? I am trying load 70b llama model and getting server crashed because of memory |
Thanks, I just added You can set tensor parallel size like below: python -m mii.entrypoints.openai_api_server \
--model "mistralai/Mistral-7B-Instruct-v0.1" \
--port 3000 \
--host 0.0.0.0 \
--tensor-parallel 2 |
Hi @mrwyattii, this PR is ready for review |
Can you please example of inference using api server url..I am trying to deploy in k8 and want to do inference from different application |
this is OpenAI compatible api server so you can run it with python -m mii.entrypoints.openai_api_server \
--model "mistralai/Mistral-7B-Instruct-v0.1" \
--port 3000 \
--host 0.0.0.0 then use it with OpenAI package libraries or directly with HTTP request example: curl http://ip:port/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}' or using python library from openai import OpenAI
client = OpenAI(
base_url="http://ip:port/v1",
api_key="",
)
completion = openai.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[
{
"role": "user",
"content": "How do I output all files in a directory using Python?",
},
],
)
print(completion.choices[0].message.content) supports text and chat completion requests |
@PawanOsman I'll review this today/tomorrow and share any feedback. Thank you for the contribution! |
Can you please provide one example for text generation api server for inference .. |
You can run the Text Generation RESTful API server using this command python -m mii.entrypoints.api_server \
--model "mistralai/Mistral-7B-Instruct-v0.1" \
--port 3000 \
--host 0.0.0.0 then you can use it by sending an HTTP request. Client Usage Example: curl http://ip:port/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Deepspeed is ",
"max_tokens": 256,
"temperature": 0.7
}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all this work @PawanOsman! I left a few comments/suggestions.
It would be great if we could add unit tests and documentation. Documentation can go on the repo landing page or in a mii/entrypoints/README.md
that we link from the landing page. I can take care of that if you do not have the time (you have already done a lot!)
I think this is a great addition to DeepSpeed-MII and I would like to replace the existing RESTful API with the implementation you provide here and better integrate it with the rest of the MII code. However, I don't want to delay merging this. I can work on some refactoring and replacing the other RESTful API in a future PR.
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Thanks for the review and your feedback! I'm currently short on time and don't want to delay things. If you could take on the unit tests and documentation, that would be great. |
@PawanOsman Can you please run formatting on your branch and then I can merge this? Sorry for the delay here!
|
Getting below error while running this
Also while running with option --load-balance "0.0.0.0:50050" , the api is started. But getting Internal Server Error while running a curl command.
Please help |
|
I think |
Hey everyone,
I just pushed a draft PR where I've added an OpenAI-compatible RESTful API to DeepSpeed-MII. This update is about making our tool more flexible and user-friendly, especially for those looking to integrate with OpenAI's ecosystem.
Also, I added another API for normal text generation. This new API is super user-friendly and supports streaming responses.
I'm still working on it, so it's not final yet. it may contain bugs and errors.
Any thoughts or feedback are welcome!
Fixes: #316
This pull request introduces two RESTful API servers:
OpenAI Compatible RESTful API
This server is an OpenAI compatible api for text and chat completions.
Running the Server
python -m mii.entrypoints.openai_api_server \ --model "mistralai/Mistral-7B-Instruct-v0.1" \ --port 3000 \ --host 0.0.0.0
Key Features and Arguments
--chat-template
: To set the chat template (can be the file path, or the file content)--response-role
: Defines the role of the responder (e.g., "assistant") only for the requests which add_generation_prompt is equals to true.--api-keys
: Enables API key authentication for security, which can be a list of keys separated by a comma.--ssl
: Option to enable SSL for secure communication.Text Generation RESTful API
A simpler API focused on text generation, both in streaming and non-streaming formats. Suitable for applications that require straightforward text generation capabilities.
Running the Server
python -m mii.entrypoints.api_server \ --model "mistralai/Mistral-7B-Instruct-v0.1" \ --port 3000 \ --host 0.0.0.0
Common Features for Both APIs
--load-balancer
: when you run MII instant separately you can set the load balancer host and port (e.g., "localhost:50050")Separately Running MII Instance
You can start the MII instance separately and then connect the servers to the MII instance load balancer.
Running MII Instance
Connecting to the Load Balancer
For the OpenAI Compatible Server:
For the Text Generation Server: