From 35979c01feb2ac3a81d5fd9a283d2fa9c9ef8681 Mon Sep 17 00:00:00 2001 From: kerthcet Date: Sun, 18 Aug 2024 12:45:06 +0800 Subject: [PATCH] Add llmaz to Inference Signed-off-by: kerthcet --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index c35b656..792623a 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,7 @@ | ---- | ---- | ---- | ---- | ---- | ---- | | **[DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII)** | ![Stars](https://img.shields.io/github/stars/microsoft/deepspeed-mii.svg) | ![Release](https://img.shields.io/github/release/microsoft/deepspeed-mii) | ![Contributors](https://img.shields.io/github/contributors/microsoft/deepspeed-mii) | MII makes low-latency and high-throughput inference possible, powered by DeepSpeed. | | | **[ipex-llm](https://github.com/intel-analytics/ipex-llm)** | ![Stars](https://img.shields.io/github/stars/intel-analytics/ipex-llm.svg) | ![Release](https://img.shields.io/github/release/intel-analytics/ipex-llm) | ![Contributors](https://img.shields.io/github/contributors/intel-analytics/ipex-llm) | Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc. | edge | +| **[llmaz](https://github.com/InftyAI/llmaz)** | ![Stars](https://img.shields.io/github/stars/inftyai/llmaz.svg) | ![Release](https://img.shields.io/github/release/inftyai/llmaz) | ![Contributors](https://img.shields.io/github/contributors/inftyai/llmaz) | ☸️ Effortlessly serve state-of-the-art LLMs on Kubernetes. | | | **[LMDeploy](https://github.com/InternLM/lmdeploy)** | ![Stars](https://img.shields.io/github/stars/internlm/lmdeploy.svg) | ![Release](https://img.shields.io/github/release/internlm/lmdeploy) | ![Contributors](https://img.shields.io/github/contributors/internlm/lmdeploy) | LMDeploy is a toolkit for compressing, deploying, and serving LLMs. | | | **[llama.cpp](https://github.com/ggerganov/llama.cpp)** | ![Stars](https://img.shields.io/github/stars/ggerganov/llama.cpp.svg) | ![Release](https://img.shields.io/github/release/ggerganov/llama.cpp) | ![Contributors](https://img.shields.io/github/contributors/ggerganov/llama.cpp) | LLM inference in C/C++ | edge | | **[MInference](https://github.com/microsoft/minference)** | ![Stars](https://img.shields.io/github/stars/microsoft/minference.svg) | ![Release](https://img.shields.io/github/release/microsoft/minference) | ![Contributors](https://img.shields.io/github/contributors/microsoft/minference) | To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy. | |