Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding Punica integeration #107

Open
psych0v0yager opened this issue Dec 6, 2023 · 4 comments
Open

Question regarding Punica integeration #107

psych0v0yager opened this issue Dec 6, 2023 · 4 comments
Labels
question Further information is requested

Comments

@psych0v0yager
Copy link

The acknowledgements of this project mention the SGMV kernels created by the Punica project. Is there a way we can run multiple adapters simultaneously using LoRAX in a similar way shown in the Punica example? Can this be done via the AsyncClient?

@tgaddair
Copy link
Contributor

tgaddair commented Dec 6, 2023

Hi @psych0v0yager, yes, there are a few ways you can achieve running multiple adapters in a single batch:

  1. Multiple clients making requests at the same time (this is the most common situation we see in production)
  2. Making multiple requests using AsyncClient and then awaiting at the end (batch request submission)
  3. Using another concurrency system like threading and making an HTTP request directly to the endpoint

We have a very simple example of (3) here, but I'll make a note to add more examples of how to do this using AsyncClient.

Hope that answers your question!

@tgaddair tgaddair added the question Further information is requested label Dec 6, 2023
@psych0v0yager
Copy link
Author

Thanks for the fast reply and multiple solutions! I'll be sure to check out your example for number 3, and I look forward to seeing more documentation on the AsyncClient. Imo the AsyncClient seems like the most convenient for a MoE type situation.

@tgaddair
Copy link
Contributor

tgaddair commented Dec 6, 2023

Awesome, @psych0v0yager to help me understand your use case a little better, for the MoE situation you're describing, are you interested in generating a different sequence for each adapter and then combining them, or mixing multiple adapters for the same request and generating a single sequence? It sounds like the first one (generating a different sequence for each adapter), but wanted to confirm, as both are use cases we want to support.

@psych0v0yager
Copy link
Author

@tgaddair thanks for the reply! I was interested in the first one (generating a different sequence for each adapter).

Specifically I was imagining running 5 adapters concurrently, each of them generating a different sequence. Once the batch of 5 is done, I want to feed all 5 sequences to a 6th adapter that is finetuned to select the best sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants