03 Jul 17:29

c0sogi

cde07e2

v1.1.3.4.1 Latest

Latest

Hotfix

Fixed error loading LlamaTokenizer
Added auto cuBLAS dll build (Windows) script when importing llama_cpp from llama-cpp-python repository

Assets 2

03 Jul 14:02

c0sogi

v1.1.3.4

ee8f92d

v1.1.3.4

Exllama support

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs. It uses pytorch and sentencepiece to run the model.

It is assumed to work only in the local environment and at least one NVIDIA CUDA GPU is required. You have to download tokenizer, config, and GPTQ files from huggingface and put it in the llama_models/gptq/YOUR_MODEL_FOLDER folder

Define LLMModel in app/models/llms.py. There are few examples, so you can easily define your own model. Refer to the exllama repository for more detailed information: https://github.com/turboderp/exllama

Important!

Nvidia GPU only. To use exllama model, you have to download pytorch and sentencepiece manually and define ExllamaModel in llms.py

Assets 2

29 Jun 07:51

c0sogi

v1.1.3.3

b5fd98e

v1.1.3.3

Automatically monitors the underlying Llama.cpp API server process for driving the local LLM model. Introduced a more flexible communication method over the network from the IPC method through Queue and Event in the existing process pool.
Local embedding via Llama.cpp model and huggingface embedding model. For the former, you need to set the embedding=True option when defining LlamaCppModel. For the latter, you need to install pytorch additionally and set a huggingface repository such as intfloat/e5-large-v2 in the value of LOCAL_EMBEDDING_MODEL in the .env file.

Assets 2

15 Jun 01:14

c0sogi

v1.1.3.2

90da1ee

v1.1.3.2

Set the web browsing default mode to Full browsing.

Full browsing: Clicking links and scrolling through webpages based on the query provided. This consumes a lot of tokens.
Light browsing: Compose answer based on snippets from the search engine with the provided query. This consumes fewer tokens.

Assets 2

12 Jun 15:07

c0sogi

v1.1.3.1

c351e1a

v1.1.3.1

Added Web browsing mode(beta). It is available by enabling the Browse toggle button. By default, browsing is performed using the DuckDuckGo search engine.

Assets 2

01 Jun 03:08

c0sogi

v1.1.3.0

6f5d5e8

v1.1.3.0

The way chat message list is loaded from Redis has been changed from eager load to lazy load. It now loads all of the user's chat profiles first, and then loads the messages when they enter the chat. This dramatically reduces the initial loading time if you already have a large list of messages.
You can set User role, AI role, and System role for each LLM. For OpenAI's ChatGPT, user, assistant, and system are used by default. For other LLaMa models, you can set other types of roles, which can help the LLM recognize the conversation role.
Auto summarization is now applied. By default, when you type or receive a long message of 512 tokens or more, the Summarization background task for that message will run and when it finishes, it will be quietly saved to the message list. The summarized content is invisible to the user, but when sending messages to the LLM, the summarized message is passed along, which can be a huge savings in token usage (and cost).
To overcome the performance limitations of Redis vectorstore (single-threaded) and replace the inaccurate KNN similarity search with cosine similarity search, we introduced Qdrant vectorstore. It enables fast asynchronous vector queries in microseconds via gRPC, a low-level API.

Assets 2

25 May 11:39

c0sogi

v1.1.2.1

1a83a71

v1.1.2.1

PC/Mobile responsive frontend
Refactored text generation code
Added admin status in MySQL Users table.

Assets 2

20 May 08:21

c0sogi

v1.1.1

69249d0

v1.1.1

Supports dropdown chatmodel selection
Added admin console endpoint /admin.
Now vectorstore is not shared for all acoounts. Every account has own vectorstore, but will share public database, which can be embedded by /share command.
Added token status box in frontend
LLAMA supports GPU offloading when using cuBLAS.
Now /query command doesn't put queried texts into chat context. The queries data will only be used for generating current response.

Assets 2

17 May 06:18

c0sogi

v1.1.0

14812d9

v1.1.0

removed `gpt` terms and changed project name

Assets 2

15 May 14:45

c0sogi

v1.0.1

aaea7fa

v1.0.1

New feature: editable cha title, copy to clipboard

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix

Exllama support

Important!

Releases: c0sogi/LLMChat

v1.1.3.4.1

Hotfix

v1.1.3.4

Exllama support

Important!

v1.1.3.3

v1.1.3.2

v1.1.3.1

v1.1.3.0

v1.1.2.1

v1.1.1

v1.1.0

v1.0.1