Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Update README.md #518

Closed
wants to merge 38 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
f04d0fd
[CPP Graph] Opt qbits dequant (#465)
zhewang1-intc Oct 19, 2023
4adacf1
use INC 2.3.1
VincyZhang Oct 19, 2023
d962f58
use INC 2.3.1 (#500)
VincyZhang Oct 19, 2023
66238a5
[RUNTIME] Enabing streaming llm for Runtime (#501)
zhenwei-intel Oct 19, 2023
ea112e7
Merge branch 'main' of https://github.com/intel/intel-extension-for-t…
VincyZhang Oct 19, 2023
51485c6
Reduce the UT evaluation time (#498)
changwangss Oct 19, 2023
ff4abb8
Merge branch 'main' of https://github.com/intel/intel-extension-for-t…
VincyZhang Oct 19, 2023
9bdc764
Minor fix (#507)
VincyZhang Oct 19, 2023
ea720c2
Fix ChatGLM2 model loading issue (#510)
lvliang-intel Oct 19, 2023
02523e9
Update README.md
hshen14 Oct 19, 2023
0cff05a
Remove OneDNN env setint for BF16 inference (#509)
lvliang-intel Oct 20, 2023
ea69f9a
support Avx2 (#493)
yuchengliu1 Oct 20, 2023
f7d0d97
add neuralchat ut for audio util (#466)
Liangyx2 Oct 20, 2023
b9155ef
reduce ut time consumption (#499)
xin3he Oct 20, 2023
5f4175a
update python api readme (#504)
zhenwei-intel Oct 20, 2023
a8873ea
Add docker setup session for neuralchat finetuning sample (#496)
louie-tsai Oct 20, 2023
22fe7ad
Update README.md
hshen14 Oct 20, 2023
b38241d
Update README.md
hshen14 Oct 20, 2023
1d91245
Update README.md
hshen14 Oct 20, 2023
18d9c57
Update README.md
hshen14 Oct 20, 2023
f98d72a
Update README.md
hshen14 Oct 20, 2023
0f6aee6
Update README.md
hshen14 Oct 20, 2023
a8db98f
Update README.md for fast token issue (#515)
louie-tsai Oct 21, 2023
52717e4
Fix typo in README.md (#516)
eltociear Oct 21, 2023
3cf68ee
Update README.md
hshen14 Oct 21, 2023
7fb944a
Update README.md
hshen14 Oct 21, 2023
7fed478
Update README.md
hshen14 Oct 21, 2023
dc81e4c
Update README.md
hshen14 Oct 21, 2023
dcfbcfd
improve Avx2 (#511)
yuchengliu1 Oct 21, 2023
a615905
Merge branch 'main' of https://github.com/intel/intel-extension-for-t…
VincyZhang Oct 21, 2023
61993cc
Revert "update python api readme (#504)"
VincyZhang Oct 21, 2023
7ccddbd
Update README.md
Shivam250702 Oct 22, 2023
62d58f0
Merge branch 'main' into main
VincyZhang Oct 23, 2023
943ebd5
Merge branch 'main' into main
Shivam250702 Oct 23, 2023
8894504
Merge branch 'main' into main
Shivam250702 Oct 24, 2023
67640df
Merge branch 'main' into main
hshen14 Oct 25, 2023
549d820
Merge branch 'main' into main
kevinintel Oct 26, 2023
d2a5e9b
Merge branch 'main' into main
Shivam250702 Oct 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'main' into main
Signed-off-by: VincyZhang <wenxin.zhang@intel.com>
  • Loading branch information
VincyZhang authored Oct 23, 2023
commit 62d58f073ac735aa83d9285aacd0fb2183f339b1
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ Intel® Extension for Transformers
</div>

## 🚀Latest News
* [2023/10] LLM runtime, an Intel-optimized [GGML](https://github.com/ggerganov/ggml) compatiable runtime, demonstrates **up to 15x performance gain in 1st token generation and 1.5x in other token generation** over the default [llama.cpp](https://github.com/ggerganov/llama.cpp).
* [2023/10] LLM runtime now supports LLM inference with **infinite-length inputs up to 4 million tokens**, inspired by [StreamingLLM](https://arxiv.org/abs/2309.17453).
* [2023/10] LLM runtime, an Intel-optimized [GGML](https://github.com/ggerganov/ggml) compatible runtime, demonstrates **up to 15x performance gain in 1st token generation and 1.5x in other token generation** over the default [llama.cpp](https://github.com/ggerganov/llama.cpp).
* [2023/10] LLM runtime now supports LLM inference with **infinite-length inputs up to 4 million tokens**, inspired from [StreamingLLM](https://arxiv.org/abs/2309.17453).
* [2023/09] NeuralChat has been showcased in [**Intel Innovation’23 Keynote**](https://www.youtube.com/watch?v=RbKRELWP9y8&t=2954s) and [Google Cloud Next'23](https://cloud.google.com/blog/topics/google-cloud-next/welcome-to-google-cloud-next-23) to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
* [2023/08] NeuralChat supports **custom chatbot development and deployment within minutes** on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out [Notebooks](./intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).
* [2023/07] LLM runtime extends the Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting mainstream low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.
* [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.

---
<div align="left">
Expand All @@ -28,21 +28,21 @@ pip install intel-extension-for-transformers
> For more installation methods, please refer to [Installation Page](./docs/installation.md)

## 🌟Introduction
Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html)). The toolkit provides the following key features and examples:
Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html)). The toolkit provides the below key features and examples:

* Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformer](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor)
* Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor)

* Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754))

* Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list) and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa)
* Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa)

* [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md).

* [Inference](intel_extension_for_transformers/llm/runtime/graph) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels, supporting [GPT-NEOX](intel_extension_for_transformers/llm/runtime/graph/models/gptneox), [LLAMA](intel_extension_for_transformers/llm/runtime/graph/models/llama), [MPT](intel_extension_for_transformers/llm/runtime/graph/models/mpt), [FALCON](intel_extension_for_transformers/llm/runtime/graph/models/falcon), [BLOOM-7B](intel_extension_for_transformers/llm/runtime/graph/models/bloom), [OPT](intel_extension_for_transformers/llm/runtime/graph/models/opt), [ChatGLM2-6B](intel_extension_for_transformers/llm/runtime/graph/models/chatglm), [GPT-J-6B](intel_extension_for_transformers/llm/runtime/graph/models/gptj) and [Dolly-v2-3B](intel_extension_for_transformers/llm/runtime/graph/models/gptneox)


## 🌱Getting Started
Below is the sample code to enable chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).
Below is the sample code to enable the chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).

### Chatbot
```python
Expand Down Expand Up @@ -211,4 +211,5 @@ Find other models like ChatGLM, ChatGLM2, StarCoder... in [LLM Runtime](./intel_

## 💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach out to [us](mailto:itrex.maintainers@intel.com) and look forward to our collaborations on Intel Extension for Transformers!
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:itrex.maintainers@intel.com), and we look forward to our collaborations on Intel Extension for Transformers!

You are viewing a condensed version of this merge commit. You can view the full changes here.