Releases: c0sogi/LLMChat
v1.1.3.4.1
Hotfix
- Fixed error loading
LlamaTokenizer
- Added auto cuBLAS dll build (Windows) script when importing llama_cpp from llama-cpp-python repository
v1.1.3.4
Exllama support
A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ
weights, designed to be fast and memory-efficient on modern GPUs. It uses pytorch
and sentencepiece
to run the model.
It is assumed to work only in the local environment and at least one NVIDIA CUDA GPU
is required. You have to download tokenizer, config, and GPTQ files from huggingface and put it in the llama_models/gptq/YOUR_MODEL_FOLDER
folder
Define LLMModel in app/models/llms.py
. There are few examples, so you can easily define your own model. Refer to the exllama
repository for more detailed information: https://github.com/turboderp/exllama
Important!
Nvidia GPU only. To use exllama model, you have to download pytorch
and sentencepiece
manually and define ExllamaModel
in llms.py
v1.1.3.3
-
Automatically monitors the underlying Llama.cpp API server process for driving the local LLM model. Introduced a more flexible communication method over the network from the IPC method through Queue and Event in the existing process pool.
-
Local embedding via Llama.cpp model and huggingface embedding model. For the former, you need to set the
embedding=True
option when definingLlamaCppModel
. For the latter, you need to installpytorch
additionally and set a huggingface repository such asintfloat/e5-large-v2
in the value ofLOCAL_EMBEDDING_MODEL
in the.env
file.
v1.1.3.2
Set the web browsing default mode to Full browsing.
- Full browsing: Clicking links and scrolling through webpages based on the query provided. This consumes a lot of tokens.
- Light browsing: Compose answer based on snippets from the search engine with the provided query. This consumes fewer tokens.
v1.1.3.1
v1.1.3.0
-
The way chat message list is loaded from Redis has been changed from eager load to lazy load. It now loads all of the user's chat profiles first, and then loads the messages when they enter the chat. This dramatically reduces the initial loading time if you already have a large list of messages.
-
You can set User role, AI role, and System role for each LLM. For OpenAI's ChatGPT,
user
,assistant
, andsystem
are used by default. For other LLaMa models, you can set other types of roles, which can help the LLM recognize the conversation role. -
Auto summarization is now applied. By default, when you type or receive a long message of 512 tokens or more, the Summarization background task for that message will run and when it finishes, it will be quietly saved to the message list. The summarized content is invisible to the user, but when sending messages to the LLM, the summarized message is passed along, which can be a huge savings in token usage (and cost).
-
To overcome the performance limitations of Redis vectorstore (single-threaded) and replace the inaccurate KNN similarity search with cosine similarity search, we introduced Qdrant vectorstore. It enables fast asynchronous vector queries in microseconds via gRPC, a low-level API.
v1.1.2.1
v1.1.1
- Supports dropdown chatmodel selection
- Added admin console endpoint
/admin
. - Now vectorstore is not shared for all acoounts. Every account has own vectorstore, but will share public database, which can be embedded by
/share
command. - Added token status box in frontend
- LLAMA supports GPU offloading when using cuBLAS.
- Now
/query
command doesn't put queried texts into chat context. The queries data will only be used for generating current response.