Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add qwen2 #2495

Merged
merged 4 commits into from
Jan 22, 2024
Merged

Add qwen2 #2495

merged 4 commits into from
Jan 22, 2024

Conversation

JustinLin610
Copy link
Contributor

Recently, I have pushed the new codes of Qwen2 to Hugging Face Transformers, and thus I also would love to contribute the new model to vLLM as well.

In this PR, I have provided the implementation of Qwen2 model and add some notes on it.

@simon-mo
Copy link
Collaborator

Thank you for the contribution! Do you know where can I get the weights for Qwen/Qwen2-7B-beta? I cannot find it on Huggingface (there's only V1?).

@JustinLin610
Copy link
Contributor Author

Thank you for the contribution! Do you know where can I get the weights for Qwen/Qwen2-7B-beta? I cannot find it on Huggingface (there's only V1?).

Yes, it is not released yet. Would you mind joining our HF org to read our new models, which are temporarily private? https://huggingface.co/Qwen

BTW you can contact me through email (junyang.ljy@alibaba-inc.com) or I can join your slack channel for further discussion (my slack email: justinlin930319@gmail.com).

@simon-mo
Copy link
Collaborator

Thanks. Emailed.

@esmeetu
Copy link
Collaborator

esmeetu commented Jan 19, 2024

Where is the difference between qwen2 and Llama2 architecture? It looks like the same. If it's right, does it better to extend LlamaForCasualLM like DeciLM.

@JustinLin610
Copy link
Contributor Author

Where is the difference between qwen2 and Llama2 architecture? It looks like the same. If it's right, does it better to extend LlamaForCasualLM like DeciLM.

The code for Qwen2 is adaptive to previous Qwen as well as the next generation Qwen2. In comparison with Llama or Mistral, we have qkv bias, and the mixture of sliding window attention and full attention, which is controlled by the argument max_window_layers

@simon-mo
Copy link
Collaborator

Thanks for access. I'm testing it with python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-1_8B-Chat-beta-temp but hitting

  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 548, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/home/xmo/vllm/vllm/engine/arg_utils.py", line 218, in create_engine_configs
    model_config = ModelConfig(self.model, self.tokenizer,
  File "/home/xmo/vllm/vllm/config.py", line 101, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision)
  File "/home/xmo/vllm/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1098, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 795, in __getitem__
    raise KeyError(key)
KeyError: 'qwen2'

Can you add a Qwen2 config similar to the QwenConfig and add it to the registry?

@JustinLin610
Copy link
Contributor Author

Thanks for access. I'm testing it with python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-1_8B-Chat-beta-temp but hitting

  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 548, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/home/xmo/vllm/vllm/engine/arg_utils.py", line 218, in create_engine_configs
    model_config = ModelConfig(self.model, self.tokenizer,
  File "/home/xmo/vllm/vllm/config.py", line 101, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision)
  File "/home/xmo/vllm/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1098, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 795, in __getitem__
    raise KeyError(key)
KeyError: 'qwen2'

Can you add a Qwen2 config similar to the QwenConfig and add it to the registry?

Did you git clone the latest transformers and install it by pip install -e .? I did successfully run the command. Essentially as it is merged into HF transformers and I have no need to write a config.

@simon-mo
Copy link
Collaborator

I saw the Huggingface PR. This is pretty tricky because we don't know when will Huggingface release their new version that includes Qwen2. Can you still include it vLLM so we can release it without people waiting for Huggingface nightly/latest release?

@JustinLin610
Copy link
Contributor Author

I saw the Huggingface PR. This is pretty tricky because we don't know when will Huggingface release their new version that includes Qwen2. Can you still include it vLLM so we can release it without people waiting for Huggingface nightly/latest release?

Hi, HF transformers just released their new version with Qwen2 included. You can give a try and see if it works for you.

@simon-mo
Copy link
Collaborator

(base) xmo@simon-devbox:~/vllm-work-trees/awq-doc/.buildkite$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2-1_8B-Chat-beta-temp",
        "prompt": "San Francisco is a",
        "max_tokens": 7
    }'


curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "Qwen/Qwen2-1_8B-Chat-beta-temp",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

{"id":"cmpl-808bbb1b24f04232b5421f28d4fefd54","object":"text_completion","created":350809,"model":"Qwen/Qwen2-1_8B-Chat-beta-temp","choices":[{"index":0,"text":" ______ to me, so I went","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7}}
{"id":"cmpl-24e3af97038d41d5a53f5e6783cb94f8","object":"chat.completion","created":350809,"model":"Qwen/Qwen2-1_8B-Chat-beta-temp","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! Is there anything specific you would like to test or ask me? I'm here to help with any questions or information you need.<|im_end|>\n"},"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":45,"completion_tokens":31}}

verified. pushing a commit for transformers model pinning and will merge.

@simon-mo simon-mo merged commit 94b5ede into vllm-project:main Jan 22, 2024
16 checks passed
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants