-
Notifications
You must be signed in to change notification settings - Fork 12.6k
server : add VSCode's Github Copilot Chat support #12896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Sounds like someone just got Edison'd 🤡 |
There's a lot of tools like this, that work, but don't explicitly say llama.cpp, open-webui is another one (ramalama serve is just vanilla llama-server, but we try and make it easier to use, easier to pull accelerator runtimes and models): https://github.com/open-webui/docs/pull/455/files In RamaLama we are going to create a proxy that forks llama-server processes to mimic Ollama to make it even easier to use everyday llama-server. With most tools if you select generic OpenAI endpoint, llama-server works. |
* server : add VSCode's Github Copilot Chat support * cont : update handler name
@ggerganov, it seems, GET /api/tags API is missing. At least, my vscode-insiders with github.copilot version 1.308.1532 (updated |
It's probably some new logic - should be easy to add support. Feel free to open a PR if you are interested. |
This seems to be broken now. When I open the model selection dialog it shows no models with the following error in the logs:
I used the same command mentioned initially: |
Overview
VSCode recently added support to use local models with Github Copilot Chat:
https://code.visualstudio.com/updates/v1_99#_bring-your-own-key-byok-preview
This PR adds compatibility of
llama-server
with this feature.Usage
Start a
llama-server
on port 11434 with an instruct model of your choice. For example, usingQwen 2.5 Coder Instruct 3B
:# downloads ~3GB of data llama-server \ -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF \ --port 11434 -fa -ngl 99 -c 0
In VSCode -> Chat -> Manage models -> select "Ollama" (not sure why it is called like this):
Select the available model from the list and click "OK":
Enjoy local AI assistance using vanilla
llama.cpp
:Advanced context reuse for faster prompt reprocessing can be enabled by adding
--cache-reuse 256
to thellama-server
commandSpeculative decoding is also supported. Simply start the
llama-server
like this for example: