Conversation
HTTP/2 clients (e.g. Java HttpClient with HTTP_2 version) often omit the Content-Length header since HTTP/2 uses DATA frames for body framing. When DMR's reverse proxy forwards such requests to the backend via HTTP/1.1, it uses Transfer-Encoding: chunked (ContentLength == -1), which vLLM's Python/uvicorn server fails to parse — resulting in an empty body and a 422 Unprocessable Entity response. Fix by explicitly setting ContentLength = len(body) on the upstream request after replacing the body with the already-buffered bytes. This ensures a Content-Length header is always sent, consistent with how the Ollama and Anthropic handlers already handle this. llama.cpp was unaffected because its C/C++ HTTP server handles chunked encoding gracefully. Signed-off-by: Eric Curtin <eric.curtin@docker.com>
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where HTTP/2 requests forwarded to backends like vLLM would fail due to the absence of a Content-Length header. The change explicitly sets the ContentLength on the forwarded request after buffering the body. This is a correct and direct fix for the problem, ensuring better compatibility with HTTP/1.1 backends that do not handle chunked transfer encoding well.
|
@ericcurtin - Tested on macOS (Apple Silicon) with the patched binary (v1.1.8-2-g771b9b0a) running standalone on port 13434. The standalone model-runner binary doesn't appear to support h2c, so I wasn't able to exercise the HTTP/2 code path directly: curl --http2 sends HTTP/1.1 with an Upgrade: h2c header — the server doesn't upgrade, and the request still hits vLLM as HTTP/1.1 (422) Is there a way to swap the patched binary into Docker Desktop's model-runner? Is there a plan to support streaming as well? |
|
Streaming in what way? (sometimes that word in used in different ways in this space) |
HTTP/2 clients (e.g. Java HttpClient with HTTP_2 version) often omit the Content-Length header since HTTP/2 uses DATA frames for body framing. When DMR's reverse proxy forwards such requests to the backend via HTTP/1.1, it uses Transfer-Encoding: chunked (ContentLength == -1), which vLLM's Python/uvicorn server fails to parse — resulting in an empty body and a 422 Unprocessable Entity response.
Fix by explicitly setting ContentLength = len(body) on the upstream request after replacing the body with the already-buffered bytes. This ensures a Content-Length header is always sent, consistent with how the Ollama and Anthropic handlers already handle this. llama.cpp was unaffected because its C/C++ HTTP server handles chunked encoding gracefully.