Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing hf config args with openai server #2547

Open
Aakash-kaushik opened this issue Jan 22, 2024 · 11 comments · May be fixed by #5836
Open

Allow passing hf config args with openai server #2547

Aakash-kaushik opened this issue Jan 22, 2024 · 11 comments · May be fixed by #5836

Comments

@Aakash-kaushik
Copy link

Aakash-kaushik commented Jan 22, 2024

Hi,

Is there a specific reason for why can't we allow passing of args from the openai server to the HF config class, there are very reasonable use cases where i would want to override the existing args in a config while running the model dynamically though the server.

reference line

simply allowing *args in the openai server that are passed to this while loading the model, i believe there are internal checks for failing if anything configured is wrong anyway.

supported documentation in the transformers library:

        >>> # Change some config attributes when loading a pretrained config.
        >>> config = AutoConfig.from_pretrained("bert-base-uncased", output_attentions=True, foo=False)
        >>> config.output_attentions
        True
Copy link
Collaborator

I believe there's no fundamental reason to this. Contribution welcomed! I would say you can add this to ModelConfig class and pass it through EngineArgs.

@simon-mo simon-mo added the good first issue Good for newcomers label Jan 23, 2024 — with Linear
@KrishnaM251
Copy link

I will take a look at this

@mrPsycox
Copy link

mrPsycox commented Feb 7, 2024

Anyone has news about that? I want to use --dtype, but it doesn't work

@Aakash-kaushik
Copy link
Author

@mrPsycox —dtype is supported in vllm, please take a look at the engine args on vllm docs

@mrPsycox
Copy link

mrPsycox commented Feb 8, 2024

Thanks @Aakash-kaushik , I found the issue. Passing --dtype need to be in first args of the command, not in the last.

This works for me:

 run: |
   conda activate vllm
   python -m vllm.entrypoints.openai.api_server \
     --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
     --dtype half \
     --host 0.0.0.0 --port 8080 \
     --model <model_ name>

@timbmg
Copy link

timbmg commented Apr 30, 2024

Just as a workaround, I am currently doing something like this:

import shutil
import os
from contextlib import contextmanager

@contextmanager
def swap_files(file1, file2):
    try:
        
        temp_file1 = file1 + '.temp'
        temp_file2 = file2 + '.temp'
            
        print("Renaming Files.")
        os.rename(file1, temp_file1)
        os.rename(file2, file1)
        os.rename(temp_file1, file2)
        
        yield
        
    finally:
        print("Restoring Files.")
        os.rename(file2, temp_file2)
        os.rename(file1, file2)
        os.rename(temp_file2, file1)

file1 = '/path/to/original/config.json'
file2 = '/path/to/modified/config.json'

with swap_files(file1, file2):
    llm = LLM(...)

@K-Mistele
Copy link
Contributor

I would love to see this as well

@KrishnaM251
Copy link

@Aakash-kaushik @mrPsycox @timbmg @K-Mistele

Please take a look at my PR and let me know if it serves your purpose.

As @DarkLight1337 noted in my PR (#5836) , what exactly do you want to accomplish using this feature that cannot otherwise be done via vLLM args? (If we don't have any situation that results in different vLLM output, what is the point of enabling this?)

Once you get back to me, I'll write a test that covers that case.

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
@K-Mistele
Copy link
Contributor

Hi guys, just bumping this in case it's still relevant. Maybe not so much passing hf config.json args at request-time, but being able to set them for the OpenAI compatible server without having to dig into the model's cache directory would be super useful.

Some examples of where this would be applicable include configuring Qwen models and llama models' RoPE scaling:

Processing Long Texts
The current config.json is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For supported frameworks, you could add the following to config.json to enable YaRN:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

For deployment, we recommend using vLLM. Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

Maybe this is already implemented somewhere else?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 30, 2024

I proposed a similar feature in #5205, still looking for someone to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants