-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document vLLM integration as a first class citizen in Triton #88
Conversation
1abfded
to
ec8faf4
Compare
examples/README.md
Outdated
@@ -54,6 +54,14 @@ Management](https://github.com/triton-inference-server/server/blob/main/docs/use | |||
which allows backends to behave in a stateless manner and leave the | |||
state management to Triton. | |||
|
|||
Triton also provides an option to create python based backends. These | |||
backends should implement a model architecture agnostic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backends should implement a model architecture agnostic | |
backends should implement the TrtionPythonModel interface as ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
Co-authored-by: Neelay Shah <neelays@nvidia.com>
examples/README.md
Outdated
you may find it helpful to enhance your implementation by adding ` initialize`, | ||
`finalize`, and any other helper functions. For examples, please refer to | ||
the [vLLM backend](https://github.com/triton-inference-server/vllm_backend), | ||
which provides a common python script to serve models supported by vLLM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the source of truth for docs on "Python based backends" ?
If so, I think we should describe how they work in more detail wherever the source of truth is. Below are some rough ideas I think we should highlight if we don't already document it somewhere:
- To implement a python-based backend, Triton expects to find a folder in the backends directory (ex:
/opt/tritonserver/backends
) with the corresponding backend name, containing amodel.py
that follows the[PythonBackend model format](link_to_pb_docs)
. For example, using a python-based backend calledmy_python_based_backend
using the default backend directory, Triton would expect to find the full path/opt/tritonserver/backends/my_python_based_backend/model.py
, and any model of this backend would define the followingconfig.pbtxt
:
# config.pbtxt
backend: "my_python_based_backend"
# any necessary I/O or other config settings
...
- Python based backends simply load the python backend under the hood and re-use the
model.py
definition for any models that specify this backend. - You can use the python model's
auto_complete_config
(link here) function to define the I/O upfront so anyconfig.pbtxt
using this python-based backend isn't required to specify these settings. - ...
Example tritonserver output loading a model from a python-based backend:
I1013 21:52:45.756456 18668 server.cc:619]
+-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability" |
| | | :"6.000000","default-max-batch-size":"4"}} |
| my_python_based_backend | /opt/tritonserver/backends/my_python_based_backend/model.py | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability" |
| | | :"6.000000","default-max-batch-size":"4"}} |
+-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
For an example of this, see the <vllm backend docs>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this level of detail on the main backend README as perhaps a new section instead of the examples. In the beginning we can add c/c++/python (with python hyperlinked to a new section)
The new section "Python based backends" would cover all the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, working almost in this direction for the next revision. I put a new md
file under docs/
and describe it there, to keep main README concise
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
to enhance your implementation by adding `initialize`, `finalize`, | ||
and any other helper functions. Users are also encouraged to make use of the | ||
[`auto_complete_config`](https://github.com/triton-inference-server/python_backend#auto_complete_config) | ||
function to define standardized input and output properties upfront. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to mention which Python interpreter is going to be used and add a link to the Custom Execution environments doc in Python backend so that the user knows how to customize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned earlier on L52
`/opt/tritonserver/backends/my_python_based_backend/` or | ||
`/opt/tritonserver/backends/python/`. | ||
* Specify `my_python_based_backend` as a backend in `config.pbtxt` | ||
for any model, that should use this backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to mention how the priority would work in case triton_python_backend_stub
is present in the Python based Python backend folder (i.e., this would take priority over the stub present in the Python backend folder). Additionally, please mention the use-case that the user might want to do this (i.e., use a different Python version than what is shipped by default in the container).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I'll certainly add this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Customization
section latter in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also open a PR to vLLM backend readme that link Python-based backend to this doc?
--------- Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Adding entries on vLLM python based backend to our docs.
I also thought about adding example to examples/backend folder, but based on pytorch python based backend, I think our former pytorch platform handler is more generic and makes more sense to use it as an example.
Open to suggestions.