Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Transformers 4.45.1 slows down outlines guided decoding #9032

Open
1 task done
joerunde opened this issue Oct 2, 2024 · 2 comments
Open
1 task done
Labels
performance Performance-related issues

Comments

@joerunde
Copy link
Contributor

joerunde commented Oct 2, 2024

Report of performance regression

I noticed that guided decoding was a bit slower on newer builds of vllm, but couldn't track down a commit that caused a performance regression. Instead it looks like upgrading transformers from 4.44.2 to 4.45.1 causes the issue.

I ran a small artillery test with requests using guided decoding, using the code from commit 4f1ba0844. This is the last commit before mllama support was added, so it's the last point where vllm will work with both transformers versions 4.44.2 and 4.45.1. VLLM was run with 1xA100 gpu, using model mistralai/Mistral-7B-Instruct-v0.2

The results with 4.44.2 installed:

http.codes.200: ................................................................ 240
http.downloaded_bytes: ......................................................... 91928
http.request_rate: ............................................................. 3/sec
http.requests: ................................................................. 240
http.response_time:
  min: ......................................................................... 105
  max: ......................................................................... 16348
  mean: ........................................................................ 6655.3
  median: ...................................................................... 3905.8
  p95: ......................................................................... 15526
  p99: ......................................................................... 16159.7
http.responses: ................................................................ 240
vusers.completed: .............................................................. 60
vusers.created: ................................................................ 60
vusers.created_by_name.Test completions: ....................................... 60
vusers.failed: ................................................................. 0
vusers.session_length:
  min: ......................................................................... 15318.1
  max: ......................................................................... 38021.7
  mean: ........................................................................ 26628.2
  median: ...................................................................... 27730.6
  p95: ......................................................................... 33199.7
  p99: ......................................................................... 35964.9

and with 4.45.1 installed:

http.codes.200: ................................................................ 240
http.downloaded_bytes: ......................................................... 92209
http.request_rate: ............................................................. 3/sec
http.requests: ................................................................. 240
http.response_time:
  min: ......................................................................... 100
  max: ......................................................................... 27083
  mean: ........................................................................ 10279.2
  median: ...................................................................... 5065.6
  p95: ......................................................................... 26115.6
  p99: ......................................................................... 27181.5
http.responses: ................................................................ 240
vusers.completed: .............................................................. 60
vusers.created: ................................................................ 60
vusers.created_by_name.Test completions: ....................................... 60
vusers.failed: ................................................................. 0
vusers.session_length:
  min: ......................................................................... 19387.6
  max: ......................................................................... 55055.6
  mean: ........................................................................ 41123.7
  median: ...................................................................... 43928
  p95: ......................................................................... 51550.2
  p99: ......................................................................... 53654.1

The slowdown looks pretty significant to me 🐌🐌🐌

I wasn't able to get the vllm profiling to work to try to dig in at all, unfortunately it kept crashing with encoding errors whenever I ran any requests with guided decoding. So, I don't know if this is a problem with vllm, with outlines, or with transformers. But given that outlines hasn't been updated in quite a while and sglang went and forked it- I'm not sure if this is worth investigating as is or if it'll be overcome by events.

Anybody have ideas about what could be going wrong?

The scripts I ran:

artillery.yaml

config:
  timeout: 100
  target: http://rundemc-dev-service:8000
  phases:
    - duration: 180
      arrivalRate: 1
      name: Load test

  payload:
    # path is relative to the location of the test script
    path: 'payloads.csv'
    fields:
      - prompt
    name: unused

  variables:
    model_id:
      - "mistralai/Mistral-7B-Instruct-v0.2"
    backend:
      - "lm-format-enforcer"


scenarios:
  - name: Test completions
    flow:
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
            guided_decoding_backend: "{{ backend }}"
            guided_choice:
              - "foo"
              - "bar"
              - "baz"
              - "buzz"
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
            guided_decoding_backend: "{{ backend }}"
            response_format:
              type: "json_object"
      - post:
          url: "/v1/completions"
          json:
            model: "{{ model_id }}"
            prompt: "{{ prompt }}"
            max_tokens: 40
            guided_decoding_backend: "{{ backend }}"
            guided_json:
              type: "object"
              properties:
                name:
                  type: string
                age:
                  type: integer

payloads.csv

"hello world this is jesus"
"Lorem ipsum dolor"
"Write a function that sums two numbers together"

(obviously very scientific 😉 )

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@joerunde joerunde added the performance Performance-related issues label Oct 2, 2024
@DarkLight1337
Copy link
Member

cc @RonanKMcGovern since you've been running into issue as well.

@joerunde joerunde changed the title [Performance]: Transformer 4.45.1 slows down outlines guided decoding [Performance]: Transformers 4.45.1 slows down outlines guided decoding Oct 3, 2024
@mgoin
Copy link
Collaborator

mgoin commented Oct 8, 2024

Apparently the latest release of outlines has lots of performance enhancement https://github.com/dottxt-ai/outlines/releases/tag/0.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

3 participants