January 14, 2024:
NOTE: if you use Windows and want to try TinyLlama 1.1B and Mistral 7B with OnnxStream, you can download the EXE file from the Releases of this repo. The EXE file will automatically download the selected model's parameters from Hugging Face on the first execution (no need to git clone). For the other supported platforms, you need to build the application following the instructions in the main README, specifying the "-DOS_LLM=ON" option in cmake, and, optionally, the "-DOS_CUDA=ON" option to enable GPU acceleration.
TinyLlama 1.1B on my 2018 laptop PC (CPU-only), precision: FP16:
tinyllama.mp4
Mistral 7B on Google Colab Free (GPU: Nvidia T4 16GB), precision: FP16:
mistral7b.mp4
OnnxStream can run Stable Diffusion XL Base and Turbo on a Raspberry PI Zero 2. It was designed to run large models on devices with little RAM, thanks to its peculiar architecture where the inference engine is decoupled from the component that provides the weights and which allows the "streaming" of the parameters during inference. The initial idea was to try to run TinyLlama on the Raspberry PI Zero 2, but the performance was in the order of 10 minutes per token, thus configuring no real use case for this solution. So I decided to implement a series of optimizations to run LLMs on desktop computers and servers at acceptable (ie "interactive") speeds with OnnxStream. The biggest difference compared to running on the RPI Zero 2 is the use of the RamWeightsProvider
, which loads and caches all model parameters into the system's RAM and/or VRAM during the first model inference. All of these optimizations are optional when using the library, so OnnxStream can be used in both scenarios, both on devices with little RAM, and on computers with enough RAM and/or VRAM to load the entire model into memory.
The advantage of the ONNX standard and format is the ability of supporting new models easily: simply export the ONNX file from the original PyTorch model, run ONNX Simplifier on it (optional but recommended) and finally run onnx2txt to convert the ONNX file into a format that can be run by OnnxStream. TinyLlama and Mistral7B's example chat application uses OnnxStream models that have been exported directly from the LlamaForCausalLM and MistralForCausalLM classes from the HF Transformers library respectively.
This version of OnnxStream features initial support for GPU acceleration, via cuBLAS for Nvidia cards. Through the "--cuda" parameter it is possible to specify the number of GBs of the model to offload to the GPU. This effectively allows to use all the system memory (RAM+VRAM) and allows a certain level of parallelism between the CPU and the GPU. When partial GPU offloading is enabled, only "static matmuls" are offloaded to the GPU (at least in this version of OnnxStream) for a number of GBs of VRAM that can be specified with the "--cuda" option. Static matmuls are those matmuls where the second operand is static and, in this case, these are the torch.nn.Linear
layers in PyTorch. The learned weights of all the Linear layers represent almost all of the parameters of a transformer-based LLM and are therefore the most memory bandwidth intensive operations.
Although OnnxStream already supports 8-bit static and dynamic quantization and calibration, only FP16 and FP32 inference is currently supported for LLMs. 8-bit quantization for LLMs will be implemented in the next release (hopefully).
Click to expand
import transformers
import torch
import torch.nn as nn
import onnx
pipeline = transformers.pipeline(
"text-generation",
model="PY007/TinyLlama-1.1B-Chat-v0.3",
torch_dtype=torch.float32,
device_map="auto",
)
class LlamaModel(nn.Module):
def __init__(self, model):
super(LlamaModel, self).__init__()
self.model = model
def forward(self, input_ids, attention_mask, position_ids,
pkv0, pkv1, pkv2, pkv3, pkv4, pkv5, pkv6, pkv7, pkv8, pkv9, pkv10,
pkv11, pkv12, pkv13, pkv14, pkv15, pkv16, pkv17, pkv18, pkv19, pkv20,
pkv21, pkv22, pkv23, pkv24, pkv25, pkv26, pkv27, pkv28, pkv29, pkv30,
pkv31, pkv32, pkv33, pkv34, pkv35, pkv36, pkv37, pkv38, pkv39, pkv40,
pkv41, pkv42, pkv43):
past_key_values = (
(pkv0, pkv1), (pkv2, pkv3), (pkv4, pkv5), (pkv6, pkv7), (pkv8, pkv9), (pkv10,
pkv11), (pkv12, pkv13), (pkv14, pkv15), (pkv16, pkv17), (pkv18, pkv19), (pkv20,
pkv21), (pkv22, pkv23), (pkv24, pkv25), (pkv26, pkv27), (pkv28, pkv29), (pkv30,
pkv31), (pkv32, pkv33), (pkv34, pkv35), (pkv36, pkv37), (pkv38, pkv39), (pkv40,
pkv41), (pkv42, pkv43))
o = self.model(use_cache=True, return_dict=True,
input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values)
pkv = o.past_key_values
return [o.logits,
pkv[0][0], pkv[0][1], pkv[1][0], pkv[1][1], pkv[2][0], pkv[2][1], pkv[3][0], pkv[3][1], pkv[4][0], pkv[4][1],
pkv[5][0], pkv[5][1], pkv[6][0], pkv[6][1], pkv[7][0], pkv[7][1], pkv[8][0], pkv[8][1], pkv[9][0], pkv[9][1],
pkv[10][0], pkv[10][1], pkv[11][0], pkv[11][1], pkv[12][0], pkv[12][1], pkv[13][0], pkv[13][1], pkv[14][0], pkv[14][1],
pkv[15][0], pkv[15][1], pkv[16][0], pkv[16][1], pkv[17][0], pkv[17][1], pkv[18][0], pkv[18][1], pkv[19][0], pkv[19][1],
pkv[20][0], pkv[20][1], pkv[21][0], pkv[21][1] ]
with torch.no_grad():
dummy_input = (torch.tensor([[1, 2, 3]], dtype=torch.int64),
torch.tensor([[1, 1, 1, 1, 1, 1]], dtype=torch.int64),
torch.tensor([[3, 4, 5]], dtype=torch.int64))
for i in range(44):
dummy_input += (torch.randn(1, 4, 3, 64),)
input_names = [ "input_ids", "attention_mask", "position_ids",
"pkv0", "pkv1", "pkv2", "pkv3", "pkv4", "pkv5", "pkv6", "pkv7", "pkv8", "pkv9", "pkv10",
"pkv11", "pkv12", "pkv13", "pkv14", "pkv15", "pkv16", "pkv17", "pkv18", "pkv19", "pkv20",
"pkv21", "pkv22", "pkv23", "pkv24", "pkv25", "pkv26", "pkv27", "pkv28", "pkv29", "pkv30",
"pkv31", "pkv32", "pkv33", "pkv34", "pkv35", "pkv36", "pkv37", "pkv38", "pkv39", "pkv40",
"pkv41", "pkv42", "pkv43" ]
output_names = [ "logits",
"opkv0", "opkv1", "opkv2", "opkv3", "opkv4", "opkv5", "opkv6", "opkv7", "opkv8", "opkv9", "opkv10",
"opkv11", "opkv12", "opkv13", "opkv14", "opkv15", "opkv16", "opkv17", "opkv18", "opkv19", "opkv20",
"opkv21", "opkv22", "opkv23", "opkv24", "opkv25", "opkv26", "opkv27", "opkv28", "opkv29", "opkv30",
"opkv31", "opkv32", "opkv33", "opkv34", "opkv35", "opkv36", "opkv37", "opkv38", "opkv39", "opkv40",
"opkv41", "opkv42", "opkv43" ]
torch.onnx.export(LlamaModel(pipeline.model), dummy_input, "/media/vito/new_disk/Downloads_temp3/test3/TinyLlama.onnx", verbose=False,
input_names=input_names, output_names=output_names,
opset_version=14, do_constant_folding=True, export_params=True,
dynamic_axes={
'input_ids': {1: 'dim0'},
'attention_mask': {1: 'dim1'},
'position_ids': {1: 'dim2'},
'pkv0': {2: 'dim3'}, 'pkv1': {2: 'dim3'}, 'pkv2': {2: 'dim3'}, 'pkv3': {2: 'dim3'}, 'pkv4': {2: 'dim3'},
'pkv5': {2: 'dim3'}, 'pkv6': {2: 'dim3'}, 'pkv7': {2: 'dim3'}, 'pkv8': {2: 'dim3'}, 'pkv9': {2: 'dim3'},
'pkv10': {2: 'dim3'}, 'pkv11': {2: 'dim3'}, 'pkv12': {2: 'dim3'}, 'pkv13': {2: 'dim3'}, 'pkv14': {2: 'dim3'},
'pkv15': {2: 'dim3'}, 'pkv16': {2: 'dim3'}, 'pkv17': {2: 'dim3'}, 'pkv18': {2: 'dim3'}, 'pkv19': {2: 'dim3'},
'pkv20': {2: 'dim3'}, 'pkv21': {2: 'dim3'}, 'pkv22': {2: 'dim3'}, 'pkv23': {2: 'dim3'}, 'pkv24': {2: 'dim3'},
'pkv25': {2: 'dim3'}, 'pkv26': {2: 'dim3'}, 'pkv27': {2: 'dim3'}, 'pkv28': {2: 'dim3'}, 'pkv29': {2: 'dim3'},
'pkv30': {2: 'dim3'}, 'pkv31': {2: 'dim3'}, 'pkv32': {2: 'dim3'}, 'pkv33': {2: 'dim3'}, 'pkv34': {2: 'dim3'},
'pkv35': {2: 'dim3'}, 'pkv36': {2: 'dim3'}, 'pkv37': {2: 'dim3'}, 'pkv38': {2: 'dim3'}, 'pkv39': {2: 'dim3'},
'pkv40': {2: 'dim3'}, 'pkv41': {2: 'dim3'}, 'pkv42': {2: 'dim3'}, 'pkv43': {2: 'dim3'},
})
Click to expand
import transformers
import torch
import torch.nn as nn
import onnx
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
class LlamaModel(nn.Module):
def __init__(self, model):
super(LlamaModel, self).__init__()
self.model = model
def forward(self, input_ids, attention_mask, position_ids,
pkv0, pkv1, pkv2, pkv3, pkv4, pkv5, pkv6, pkv7, pkv8, pkv9, pkv10,
pkv11, pkv12, pkv13, pkv14, pkv15, pkv16, pkv17, pkv18, pkv19, pkv20,
pkv21, pkv22, pkv23, pkv24, pkv25, pkv26, pkv27, pkv28, pkv29, pkv30,
pkv31, pkv32, pkv33, pkv34, pkv35, pkv36, pkv37, pkv38, pkv39, pkv40,
pkv41, pkv42, pkv43, pkv44, pkv45, pkv46, pkv47, pkv48, pkv49, pkv50,
pkv51, pkv52, pkv53, pkv54, pkv55, pkv56, pkv57, pkv58, pkv59, pkv60,
pkv61, pkv62, pkv63):
past_key_values = (
(pkv0, pkv1), (pkv2, pkv3), (pkv4, pkv5), (pkv6, pkv7), (pkv8, pkv9), (pkv10,
pkv11), (pkv12, pkv13), (pkv14, pkv15), (pkv16, pkv17), (pkv18, pkv19), (pkv20,
pkv21), (pkv22, pkv23), (pkv24, pkv25), (pkv26, pkv27), (pkv28, pkv29), (pkv30,
pkv31), (pkv32, pkv33), (pkv34, pkv35), (pkv36, pkv37), (pkv38, pkv39), (pkv40,
pkv41), (pkv42, pkv43), (pkv44, pkv45), (pkv46, pkv47), (pkv48, pkv49), (pkv50,
pkv51), (pkv52, pkv53), (pkv54, pkv55), (pkv56, pkv57), (pkv58, pkv59), (pkv60,
pkv61), (pkv62, pkv63))
o = self.model(use_cache=True, return_dict=True,
input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values)
pkv = o.past_key_values
return [o.logits,
pkv[0][0], pkv[0][1], pkv[1][0], pkv[1][1], pkv[2][0], pkv[2][1], pkv[3][0], pkv[3][1], pkv[4][0], pkv[4][1],
pkv[5][0], pkv[5][1], pkv[6][0], pkv[6][1], pkv[7][0], pkv[7][1], pkv[8][0], pkv[8][1], pkv[9][0], pkv[9][1],
pkv[10][0], pkv[10][1], pkv[11][0], pkv[11][1], pkv[12][0], pkv[12][1], pkv[13][0], pkv[13][1], pkv[14][0], pkv[14][1],
pkv[15][0], pkv[15][1], pkv[16][0], pkv[16][1], pkv[17][0], pkv[17][1], pkv[18][0], pkv[18][1], pkv[19][0], pkv[19][1],
pkv[20][0], pkv[20][1], pkv[21][0], pkv[21][1], pkv[22][0], pkv[22][1], pkv[23][0], pkv[23][1], pkv[24][0], pkv[24][1],
pkv[25][0], pkv[25][1], pkv[26][0], pkv[26][1], pkv[27][0], pkv[27][1], pkv[28][0], pkv[28][1], pkv[29][0], pkv[29][1],
pkv[30][0], pkv[30][1], pkv[31][0], pkv[31][1] ]
with torch.no_grad():
dummy_input = (torch.tensor([[1, 2, 3]], dtype=torch.int64),
torch.tensor([[1, 1, 1, 1, 1, 1]], dtype=torch.int64),
torch.tensor([[3, 4, 5]], dtype=torch.int64))
for i in range(64):
dummy_input += (torch.randn(1, 8, 3, 128),)
input_names = [ "input_ids", "attention_mask", "position_ids",
"pkv0", "pkv1", "pkv2", "pkv3", "pkv4", "pkv5", "pkv6", "pkv7", "pkv8", "pkv9", "pkv10",
"pkv11", "pkv12", "pkv13", "pkv14", "pkv15", "pkv16", "pkv17", "pkv18", "pkv19", "pkv20",
"pkv21", "pkv22", "pkv23", "pkv24", "pkv25", "pkv26", "pkv27", "pkv28", "pkv29", "pkv30",
"pkv31", "pkv32", "pkv33", "pkv34", "pkv35", "pkv36", "pkv37", "pkv38", "pkv39", "pkv40",
"pkv41", "pkv42", "pkv43", "pkv44", "pkv45", "pkv46", "pkv47", "pkv48", "pkv49", "pkv50",
"pkv51", "pkv52", "pkv53", "pkv54", "pkv55", "pkv56", "pkv57", "pkv58", "pkv59", "pkv60",
"pkv61", "pkv62", "pkv63" ]
output_names = [ "logits",
"opkv0", "opkv1", "opkv2", "opkv3", "opkv4", "opkv5", "opkv6", "opkv7", "opkv8", "opkv9", "opkv10",
"opkv11", "opkv12", "opkv13", "opkv14", "opkv15", "opkv16", "opkv17", "opkv18", "opkv19", "opkv20",
"opkv21", "opkv22", "opkv23", "opkv24", "opkv25", "opkv26", "opkv27", "opkv28", "opkv29", "opkv30",
"opkv31", "opkv32", "opkv33", "opkv34", "opkv35", "opkv36", "opkv37", "opkv38", "opkv39", "opkv40",
"opkv41", "opkv42", "opkv43", "opkv44", "opkv45", "opkv46", "opkv47", "opkv48", "opkv49", "opkv50",
"opkv51", "opkv52", "opkv53", "opkv54", "opkv55", "opkv56", "opkv57", "opkv58", "opkv59", "opkv60",
"opkv61", "opkv62", "opkv63" ]
torch.onnx.export(LlamaModel(model), dummy_input, "/Users/Vito/Desktop/Mistral7BInst/model.onnx", verbose=False,
input_names=input_names, output_names=output_names,
opset_version=14, do_constant_folding=True, export_params=True,
dynamic_axes={
'input_ids': {1: 'dim0'},
'attention_mask': {1: 'dim1'},
'position_ids': {1: 'dim2'},
'pkv0': {2: 'dim3'}, 'pkv1': {2: 'dim3'}, 'pkv2': {2: 'dim3'}, 'pkv3': {2: 'dim3'}, 'pkv4': {2: 'dim3'},
'pkv5': {2: 'dim3'}, 'pkv6': {2: 'dim3'}, 'pkv7': {2: 'dim3'}, 'pkv8': {2: 'dim3'}, 'pkv9': {2: 'dim3'},
'pkv10': {2: 'dim3'}, 'pkv11': {2: 'dim3'}, 'pkv12': {2: 'dim3'}, 'pkv13': {2: 'dim3'}, 'pkv14': {2: 'dim3'},
'pkv15': {2: 'dim3'}, 'pkv16': {2: 'dim3'}, 'pkv17': {2: 'dim3'}, 'pkv18': {2: 'dim3'}, 'pkv19': {2: 'dim3'},
'pkv20': {2: 'dim3'}, 'pkv21': {2: 'dim3'}, 'pkv22': {2: 'dim3'}, 'pkv23': {2: 'dim3'}, 'pkv24': {2: 'dim3'},
'pkv25': {2: 'dim3'}, 'pkv26': {2: 'dim3'}, 'pkv27': {2: 'dim3'}, 'pkv28': {2: 'dim3'}, 'pkv29': {2: 'dim3'},
'pkv30': {2: 'dim3'}, 'pkv31': {2: 'dim3'}, 'pkv32': {2: 'dim3'}, 'pkv33': {2: 'dim3'}, 'pkv34': {2: 'dim3'},
'pkv35': {2: 'dim3'}, 'pkv36': {2: 'dim3'}, 'pkv37': {2: 'dim3'}, 'pkv38': {2: 'dim3'}, 'pkv39': {2: 'dim3'},
'pkv40': {2: 'dim3'}, 'pkv41': {2: 'dim3'}, 'pkv42': {2: 'dim3'}, 'pkv43': {2: 'dim3'}, 'pkv44': {2: 'dim3'},
'pkv45': {2: 'dim3'}, 'pkv46': {2: 'dim3'}, 'pkv47': {2: 'dim3'}, 'pkv48': {2: 'dim3'}, 'pkv49': {2: 'dim3'},
'pkv50': {2: 'dim3'}, 'pkv51': {2: 'dim3'}, 'pkv52': {2: 'dim3'}, 'pkv53': {2: 'dim3'}, 'pkv54': {2: 'dim3'},
'pkv55': {2: 'dim3'}, 'pkv56': {2: 'dim3'}, 'pkv57': {2: 'dim3'}, 'pkv58': {2: 'dim3'}, 'pkv59': {2: 'dim3'},
'pkv60': {2: 'dim3'}, 'pkv61': {2: 'dim3'}, 'pkv62': {2: 'dim3'}, 'pkv63': {2: 'dim3'},
})
Click to expand
from sentencepiece import SentencePieceProcessor
model_path = "/Users/Vito/Downloads/tokenizer.model"
sp=SentencePieceProcessor(model_file=model_path)
c = ""
for i in range(sp.vocab_size()):
s = sp.get_score(i)
si = int(s)
assert float(si) == s
c += str(si) + "," + str(sp.id_to_piece(i).replace('▁', ' ').encode('utf-8'))[2:-1] + "\n"
c = c[:-1]
with open('/Users/Vito/Downloads/vocab.txt', 'w') as f:
f.write(c)