Skip to content

Continue.dev's autocomplete feature's maxTokens is currently broken. Using vLLM and Qwen2.5-Coder-7B-W8A8 dynamic quant to test. #6003

@dzigald

Description

@dzigald

Before submitting your bug report

Relevant environment info

- OS: Windows 10
- Continue version: 1.0.11
- IDE version: VSCode 1.99.1
- Model: Qwen2.5-Coder-7B-W8A8-dynamic quant
- config:
  
{
  "models": [
    {
      "title": "AML model",
      "apiBase": "https://api.com/v1",
      "apiKey": "",
      "model": "dzigald/qwen2.5-coder-7B-W8A8-DYNAMIC",
      "provider": "openai",
      "contextLength": 2048,
      "completionOptions": {
        "maxTokens": 128
      },
      "systemMessage": "You are a helpful assistant."
    },
  ],
  "tabAutocompleteModel": {
    "title": "Tab Autocomplete Model",
    "apiBase": "https://api.com/v1",
    "apiKey": "",
    "model": "dzigald/qwen2.5-coder-7B-W8A8-DYNAMIC",
    "provider": "openai",
    "contextLength": 2048,
    "completionOptions": {
      "maxTokens": 128
    },
    "template": "<|fim_prefix|>{{{ prefix }}}<|fim_suffix|>{{{ suffix }}}<|fim_middle|>",
    "useRecentlyEdited": false
  },
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    },
    {
      "name": "diff",
      "params": {}
    },
    {
      "name": "terminal",
      "params": {}
    },
    {
      "name": "problems",
      "params": {}
    },
    {
      "name": "folder",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {}
    }
  ],
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Write comments for the selected code"
    },
    {
      "name": "share",
      "description": "Export the current chat session to markdown"
    },
    {
      "name": "cmd",
      "description": "Generate a shell command"
    },
    {
      "name": "commit",
      "description": "Generate a git commit message"
    }
  ],
  //reference https://docs.continue.dev/json-reference#tabautocompleteoptions
  //schema https://github.com/continuedev/continue/blob/main/extensions/vscode/config_schema.json
"tabAutocompleteOptions": {
  "useCopyBuffer": true,
  "maxPromptTokens": 2048,
  "prefixPercentage": 0.5,
  "multilineCompletions": "auto",
  "experimental_includeClipboard": true,
  "experimental_includeRecentlyVisitedRanges": true,
  "experimental_includeRecentlyEditedRanges": true,
  "experimental_includeDiff": true,

}
}
  
  OR link to assistant in Continue hub:

Description

I've deployed the Qwen2.5-coder-7B model with an W8A8 dynamic quant and have pulled it from my huggingface account onto a deployment on a T4 on Azure Machine Learning. The model is served via vLLM.

The bug is that the model is returning more than the maxTokens that need to be returned I've set up (128 tokens). The maxTokens in completionOptions in tabAutocompleteModel is set to 128, also I've set maxTokens in completionOptions inside of model to 128, and I've set contextLength to 2048 to anywhere applicable.

My friend has the same config in .yaml format so it's not a .json vs .yaml bug.

Apart from autocomplete cutting out the <|fim_prefix|> from the prompt that is sent out when ANY experimental_include* is set to true as per 3372and same context appearing twice as per 5336 this additional bug is really testing out the usability of code completion as well.

I know the model itself will refuse to answer to a prompt if you've asked for more than 2048 tokens from in+out calculation, because the vLLM config for serving of the model is as follows:
VLLM_ARGS: "--enable-chunked-prefill --generation-config vllm --quantization compressed-tensors --load-format auto --max-num-batched-tokens 2048 --max-num-seqs 16 --max-model-len 2048 --dtype=half --gpu-memory-utilization 0.98"

BUT, you have the following dump in the continue console:
Type: Complete
Result: Success
Prompt Tokens: 1880
Generated Tokens: 201
ThinkingTokens: 0
Total Time: 6.50s
To First Token: 0.95s
Tokens/s: 36.2

As far as I know, 1880+201 = 2081 which is more than 2048. AND, not to mention that 201 > 128 which is maxTokens. First of all, how does this all happen, and second of all, how does it bypass the vLLM arguments for the model (I know the model natively can handle more context than 2048).

I get the feeling that the maxTokens setting in general is just a general vibe of how many tokens to return as a maximum and its usage is really fuzzy. I also get the feeling that bugs stated in #3372 and #5336 are COMPLETELY FEATURE BREAKING and this makes IN GENERAL, AUTOCOMPLETE AS A WORSE WORKING FEATURE THAN IT COULD BE.

Hopefully bumping these bugs and opening this new one could help out someone and I can contribute as well if someone can point me into the right direction.

To reproduce

  1. Deploy qwen2.5-coder-7B with a W8A8-DYNAMIC quant for code completion
  2. Serve the model via vLLM with my vLLM args and limit the model to 2048 tokens
  3. Apply my settings to your own Continue.dev
  4. Enable Continue console and look into the tokens sent out and received for larger completions.

Log output

When vLLM cuts off (you set contextLength more than max-model-len of vLLM in request):
[Extension Host] Error: HTTP 424 Failed Dependency from https://api.com/v1/completions

[Extension Host] Error: HTTP 424 Failed Dependency from https://api.com/v1/completions

{"object":"error","message":"This model's maximum context length is 1024 tokens. However, you requested 1624 tokens (1496 in the messages, 128 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}
	at customFetch2 (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:140060:21)
	at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
	at async withExponentialBackoff (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:135818:27)
	at async OpenAI2._legacystreamComplete (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:140953:26)
	at async OpenAI2._streamChat (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:140971:28)
	at async OpenAI2._streamComplete (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:140912:26)
	at async OpenAI2.streamComplete (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:140280:30)
	at async stopAfterMaxProcessingTime (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:145898:20)
	at async ListenableGenerator._start (c:\Users\dzigald\.vscode\extensions\continue.continue-1.0.11-win32-x64\out\extension.js:145781:28)


You get the result otherwise as I've explained already.

Metadata

Metadata

Assignees

Labels

area:autocompleteRelates to the auto complete featureide:vscodeRelates specifically to VS Code extensionkind:bugIndicates an unexpected problem or unintended behavior

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions