[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding #9032

joerunde · 2024-10-02T22:28:16Z

Report of performance regression

I noticed that guided decoding was a bit slower on newer builds of vllm, but couldn't track down a commit that caused a performance regression. Instead it looks like upgrading transformers from 4.44.2 to 4.45.1 causes the issue.

I ran a small artillery test with requests using guided decoding, using the code from commit 4f1ba0844. This is the last commit before mllama support was added, so it's the last point where vllm will work with both transformers versions 4.44.2 and 4.45.1. VLLM was run with 1xA100 gpu, using model mistralai/Mistral-7B-Instruct-v0.2

The results with 4.44.2 installed:

http.codes.200: ................................................................ 240
http.downloaded_bytes: ......................................................... 91928
http.request_rate: ............................................................. 3/sec
http.requests: ................................................................. 240
http.response_time:
  min: ......................................................................... 105
  max: ......................................................................... 16348
  mean: ........................................................................ 6655.3
  median: ...................................................................... 3905.8
  p95: ......................................................................... 15526
  p99: ......................................................................... 16159.7
http.responses: ................................................................ 240
vusers.completed: .............................................................. 60
vusers.created: ................................................................ 60
vusers.created_by_name.Test completions: ....................................... 60
vusers.failed: ................................................................. 0
vusers.session_length:
  min: ......................................................................... 15318.1
  max: ......................................................................... 38021.7
  mean: ........................................................................ 26628.2
  median: ...................................................................... 27730.6
  p95: ......................................................................... 33199.7
  p99: ......................................................................... 35964.9

and with 4.45.1 installed:

http.codes.200: ................................................................ 240
http.downloaded_bytes: ......................................................... 92209
http.request_rate: ............................................................. 3/sec
http.requests: ................................................................. 240
http.response_time:
  min: ......................................................................... 100
  max: ......................................................................... 27083
  mean: ........................................................................ 10279.2
  median: ...................................................................... 5065.6
  p95: ......................................................................... 26115.6
  p99: ......................................................................... 27181.5
http.responses: ................................................................ 240
vusers.completed: .............................................................. 60
vusers.created: ................................................................ 60
vusers.created_by_name.Test completions: ....................................... 60
vusers.failed: ................................................................. 0
vusers.session_length:
  min: ......................................................................... 19387.6
  max: ......................................................................... 55055.6
  mean: ........................................................................ 41123.7
  median: ...................................................................... 43928
  p95: ......................................................................... 51550.2
  p99: ......................................................................... 53654.1

The slowdown looks pretty significant to me 🐌🐌🐌

I wasn't able to get the vllm profiling to work to try to dig in at all, unfortunately it kept crashing with encoding errors whenever I ran any requests with guided decoding. So, I don't know if this is a problem with vllm, with outlines, or with transformers. But given that outlines hasn't been updated in quite a while and sglang went and forked it- I'm not sure if this is worth investigating as is or if it'll be overcome by events.

Anybody have ideas about what could be going wrong?

The scripts I ran:

artillery.yaml

config:
  timeout: 100
  target: http://rundemc-dev-service:8000
  phases:
    - duration: 180
      arrivalRate: 1
      name: Load test

  payload:
    # path is relative to the location of the test script
    path: 'payloads.csv'
    fields:
      - prompt
    name: unused

  variables:
    model_id:
      - "mistralai/Mistral-7B-Instruct-v0.2"
    backend:
      - "lm-format-enforcer"


scenarios:
  - name: Test completions
    flow:
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
            guided_decoding_backend: "{{ backend }}"
            guided_choice:
              - "foo"
              - "bar"
              - "baz"
              - "buzz"
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
            guided_decoding_backend: "{{ backend }}"
            response_format:
              type: "json_object"
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
            guided_decoding_backend: "{{ backend }}"
            guided_json:
              type: "object"
              properties:
                name:
                  type: string
                age:
                  type: integer

payloads.csv

"hello world this is jesus"
"Lorem ipsum dolor"
"Write a function that sums two numbers together"

(obviously very scientific 😉 )

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-10-03T02:09:19Z

cc @RonanKMcGovern since you've been running into issue as well.

mgoin · 2024-10-08T02:50:03Z

Apparently the latest release of outlines has lots of performance enhancement https://github.com/dottxt-ai/outlines/releases/tag/0.1.0

joerunde added the performance Performance-related issues label Oct 2, 2024

joerunde changed the title ~~[Performance]: Transformer 4.45.1 slows down outlines guided decoding~~ [Performance]: Transformers 4.45.1 slows down outlines guided decoding Oct 3, 2024

joerunde mentioned this issue Oct 15, 2024

[Bug]: vLLM MQLLMEngine Timeout - Json Schema #9082

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding #9032

[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding #9032

joerunde commented Oct 2, 2024 •

edited

Loading

DarkLight1337 commented Oct 3, 2024

mgoin commented Oct 8, 2024

[Performance]: Transformers 4.45.1 slows down outlines guided decoding #9032

[Performance]: Transformers 4.45.1 slows down outlines guided decoding #9032

Comments

joerunde commented Oct 2, 2024 • edited Loading

Report of performance regression

Before submitting a new issue...

DarkLight1337 commented Oct 3, 2024

mgoin commented Oct 8, 2024

[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding #9032

[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding #9032

joerunde commented Oct 2, 2024 •

edited

Loading