Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
48ac0ac
initial commit
dbogunowicz Jun 5, 2023
cf7f2b9
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
832630a
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 6, 2023
9958c83
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 7, 2023
e6d2b03
limit to 150mb
dbogunowicz Jun 7, 2023
7f9935b
ready to review
dbogunowicz Jun 7, 2023
b1cf01b
initial commit
dbogunowicz Mar 2, 2023
0a3f48d
[Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946)
dbogunowicz Mar 16, 2023
add4625
[CodeGen][Documentation] (#956)
dbogunowicz Mar 23, 2023
22d2746
reimplementation for generative pipelines
markurtz May 8, 2023
7f1651d
restore text generation from examples
dbogunowicz May 8, 2023
b85746d
[CodeGen] ONNX model loading to support >2Gb models / two engines (#991)
dbogunowicz May 8, 2023
aadc608
refactor sucessfull
dbogunowicz May 10, 2023
58bc2b0
Pipeline fully refactored, time to test engine support. Note: Sliding…
dbogunowicz May 11, 2023
d538444
First iteration with Sage
dbogunowicz May 11, 2023
e19676b
Apply suggestions from code review
dbogunowicz May 11, 2023
7908b74
ORT agrees with the Engine. But they both give not entirely correct r…
dbogunowicz May 11, 2023
4bc3472
dynamic ORT vs static DS
dbogunowicz May 12, 2023
c07f7ed
pipeline handles OPT multitoken pass
dbogunowicz May 16, 2023
fb77838
fixes to get static pipeline a little further along
May 16, 2023
2097463
adjust shapes and slicing to enable static autoregressive pass - ISSU…
May 17, 2023
5eb10a9
migrate from cache_length to positions input
May 18, 2023
9213f29
got if working for multitoken + single token scenario
dbogunowicz May 18, 2023
d9af004
cleanup the pipeline
dbogunowicz May 19, 2023
476f25d
further cleanup post merge
dbogunowicz May 19, 2023
fab44e4
Pipeline working for single-token inference only
dbogunowicz May 19, 2023
d454e2f
do not load the onnx model with external files twice
dbogunowicz May 19, 2023
1613e25
pipeline never redundantly saves the external data + more robust toke…
dbogunowicz May 19, 2023
b61055c
Stop saving tmp files, otherwise the engine looks for external files …
dbogunowicz May 19, 2023
6ee25fc
Left pad support
May 19, 2023
5d3004b
cleanup
dbogunowicz May 22, 2023
ace6fa5
cleanup2
dbogunowicz May 22, 2023
388586d
Add in pipeline timing
markurtz May 24, 2023
afd0139
add in force tokens logic
markurtz May 24, 2023
30eeda7
remove input validation for text generation pipelines
markurtz May 24, 2023
5882b56
remove multitoken support for now
markurtz May 24, 2023
4bbe33d
remove kv cache engine and other fixes
markurtz May 25, 2023
afa5746
nest input shape override
markurtz May 25, 2023
e2bb78c
comment out input shape override
markurtz May 25, 2023
2299009
add non batch override for ORT
markurtz May 25, 2023
2935b77
clean up generation pipeline
markurtz Jun 9, 2023
b89b156
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 11, 2023
dc3d61b
initial commit
dbogunowicz Jun 5, 2023
a294265
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
af97f2b
limit to 150mb
dbogunowicz Jun 7, 2023
c117788
ready to review
dbogunowicz Jun 7, 2023
4ad5f49
fix the erronous Makefile
dbogunowicz Jun 13, 2023
9e816bb
Merge branch 'feature/damian/do_not_save_to_tmp' of https://github.co…
dbogunowicz Jun 13, 2023
f97467f
perhaps fixed GHA
dbogunowicz Jun 13, 2023
6be8d87
take into consideration that GHA creates four files
dbogunowicz Jun 13, 2023
e2f088d
initial commit
dbogunowicz Jun 13, 2023
9fc6c64
Merge remote-tracking branch 'origin/feature/damian/do_not_save_to_tm…
dbogunowicz Jun 13, 2023
a610faf
tested with actual model
dbogunowicz Jun 13, 2023
347d1fb
remove val_inp argument
dbogunowicz Jun 13, 2023
e11027c
Update README.md
dbogunowicz Jun 13, 2023
a950910
Apply suggestions from code review
dbogunowicz Jun 13, 2023
c1d02dc
Update README.md
dbogunowicz Jun 13, 2023
711cdfb
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 13, 2023
e602662
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 14, 2023
2085c37
[BugFix] Update deepsparse dockerfile (#1069)
rahul-tuli Jun 14, 2023
2f7bc95
initial implementation
dbogunowicz Jun 15, 2023
e18fab7
working implementation for pipeline input
dbogunowicz Jun 16, 2023
0358d87
[Fix] Fix CLI benchmark errors (#1071)
dbogunowicz Jun 15, 2023
06b5246
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 16, 2023
2cab681
Merge branch 'feature/damian/codegen_pipeline_clean' into feature/dam…
dbogunowicz Jun 16, 2023
63b116b
Clean a typo in the pipeline code
dbogunowicz Jun 16, 2023
cde08b9
initial commit
dbogunowicz Jun 21, 2023
99d125c
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jun 22, 2023
67ffe47
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jun 26, 2023
9937686
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jun 27, 2023
0d6a423
[KV Cache Interface] DecoderKVCache (#1084)
dbogunowicz Jun 28, 2023
0809aea
[WiP] [KV Cache Interface] Text Generation & Decoder Engine Implement…
dbogunowicz Jun 28, 2023
7001a6e
working implementation, time to cleanup
dbogunowicz Jun 29, 2023
c1bf5b7
now kv cache decoder holds information about the num of tokens prepro…
dbogunowicz Jun 29, 2023
79251e6
cleanup the old files
dbogunowicz Jun 29, 2023
9efbdb6
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jun 29, 2023
da5e93e
ready for review
dbogunowicz Jun 29, 2023
a680dac
ready for testing
dbogunowicz Jun 29, 2023
7099994
managed to get first logits right
dbogunowicz Jun 29, 2023
1d4d96d
Delete example
dbogunowicz Jun 29, 2023
08e5421
cleanup before sharing with Ben and Sage
dbogunowicz Jun 29, 2023
bfaa072
Merge branch 'feature/damian/pipeline_engine_support' of https://gith…
dbogunowicz Jun 29, 2023
fbeeb4a
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jun 29, 2023
f83dcab
assert proper padding on pipeline init
dbogunowicz Jul 3, 2023
e659c33
now also supporting kv cache perplexity. time for cleanup
dbogunowicz Jul 3, 2023
cf74ad7
ready for review
dbogunowicz Jul 3, 2023
853f876
correctly print engine info
dbogunowicz Jul 3, 2023
e8da07e
work with left padding of the tokenizer
dbogunowicz Jul 3, 2023
58b12c8
quality
dbogunowicz Jul 3, 2023
eecd232
fix the multitoken inference
dbogunowicz Jul 5, 2023
10c804a
Perplexity Eval for Text Generation Models (#1073)
dbogunowicz Jul 5, 2023
7bd23d6
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jul 5, 2023
10ba82e
[Text Generation] Run deepsparse engine without the LIB.kv_cache obje…
dbogunowicz Jul 7, 2023
e81c327
added few improvements that turned out to be useful post manual testing
dbogunowicz Jul 7, 2023
b737f77
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jul 7, 2023
042cb79
fixed the logic to assert correct multibatch inference
dbogunowicz Jul 7, 2023
bf4eac3
Merge branch 'feature/damian/fb_kv_cache' of https://github.com/neura…
dbogunowicz Jul 7, 2023
c8a1f93
fix integration tests
dbogunowicz Jul 7, 2023
d2d3dc1
initial implementation
dbogunowicz Jul 10, 2023
6ce1ca4
perplexity working, so as batched inference for different sized inputs
dbogunowicz Jul 10, 2023
47dc986
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jul 10, 2023
ef77d91
fix the integration test
dbogunowicz Jul 10, 2023
f0d74b0
Merge branch 'feature/damian/fb_kv_cache' of https://github.com/neura…
dbogunowicz Jul 10, 2023
186c80c
better solution for fixing the issues caused by this PR in GHA
dbogunowicz Jul 10, 2023
09993e7
revert changes to yolo pipeline
dbogunowicz Jul 10, 2023
ba8c126
Merge branch 'main' into feature/damian/fb_kv_cache
dbogunowicz Jul 11, 2023
37e8a02
Update src/deepsparse/transformers/engines/nl_decoder_engine.py
dbogunowicz Jul 11, 2023
0d308b9
response to Rahuls comments
dbogunowicz Jul 11, 2023
41e9306
Merge remote-tracking branch 'origin/main' into feature/damian/fb_kv_…
dbogunowicz Jul 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 0 additions & 48 deletions src/deepsparse/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@
from deepsparse.benchmark import BenchmarkResults
from deepsparse.utils import (
generate_random_inputs,
get_output_names,
join_engine_outputs,
model_to_path,
override_onnx_input_shapes,
Expand Down Expand Up @@ -56,7 +55,6 @@
"Scheduler",
"Context",
"MultiModelEngine",
"KVCacheEngine",
"BaseEngine",
]

Expand Down Expand Up @@ -867,52 +865,6 @@ def __init__(
)


class KVCacheEngine(Engine):
"""
Engine that can do kv caching.
"""

def __init__(
self,
model: Union[str, "Model", "File"],
batch_size: int = 1,
num_cores: int = None,
num_streams: int = None,
scheduler: Scheduler = None,
input_shapes: List[List[int]] = None,
kv_cache_bools: List[bool] = None,
prev_cache_length: int = 0,
):
BaseEngine.construct(
self, model, batch_size, num_cores, num_streams, scheduler, input_shapes
)

if kv_cache_bools is None:
# If no list was provided, then we assume all outputs except for the first are KV caches
# Note: In the future we can look at the names of outputs to be more sure
#
# Create a boolean list of every output of the model
output_names = get_output_names(self._model_path)
kv_cache_bools = [True for i in range(len(output_names))]
# Assume first input is logits and logits ought not to be cached
kv_cache_bools[0] = False

num_streams = _validate_num_streams(num_streams, self._num_cores)
if self._input_shapes:
raise NotImplementedError("Don't do this yet :)")
else:
self._eng_net = LIB.deepsparse_engine(
self._model_path,
self._batch_size,
self._num_cores,
num_streams,
self._scheduler.value,
None,
kv_cache_bools,
prev_cache_length,
)


def compile_model(
model: Union[str, "Model", "File"],
batch_size: int = 1,
Expand Down
62 changes: 42 additions & 20 deletions src/deepsparse/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
"yolo_pipeline",
"Bucketable",
"BucketingPipeline",
"create_engine",
]

DEEPSPARSE_ENGINE = "deepsparse"
Expand Down Expand Up @@ -753,26 +754,10 @@ def log_inference_times(self, timer: StagedTimer):
category=MetricCategories.SYSTEM,
)

def _initialize_engine(self) -> Union[Engine, ORTEngine]:
engine_type = self.engine_type.lower()

if engine_type == DEEPSPARSE_ENGINE:
if self.context is not None and isinstance(self.context, Context):
self._engine_args.pop("num_cores", None)
self._engine_args.pop("scheduler", None)
self._engine_args["context"] = self.context
return MultiModelEngine(
model=self.onnx_file_path,
**self._engine_args,
)
return Engine(self.onnx_file_path, **self._engine_args)
elif engine_type == ORT_ENGINE:
return ORTEngine(self.onnx_file_path, **self._engine_args)
else:
raise ValueError(
f"Unknown engine_type {self.engine_type}. Supported values include: "
f"{SUPPORTED_PIPELINE_ENGINES}"
)
def _initialize_engine(self) -> Union[Engine, MultiModelEngine, ORTEngine]:
return create_engine(
self.onnx_file_path, self.engine_type, self._engine_args, self.context
)

def _identifier(self):
# get pipeline identifier; used in the context of logging
Expand Down Expand Up @@ -950,6 +935,43 @@ def route_input_to_bucket(
pass


def create_engine(
onnx_file_path: str,
engine_type: str,
engine_args: Dict,
context: Optional[Context] = None,
) -> Union[Engine, MultiModelEngine, ORTEngine]:
"""
Create an inference engine for a given ONNX model

:param onnx_file_path: path to ONNX model file
:param engine_type: type of engine to create.
:param engine_args: arguments to pass to engine constructor
:param context: context to use for engine
:return: inference engine
"""
engine_type = engine_type.lower()
Comment thread
rahul-tuli marked this conversation as resolved.

if engine_type == DEEPSPARSE_ENGINE:
if context is not None and isinstance(context, Context):
engine_args.pop("num_cores", None)
engine_args.pop("scheduler", None)
engine_args["context"] = context
return MultiModelEngine(
model=onnx_file_path,
**engine_args,
)
return Engine(onnx_file_path, **engine_args)

if engine_type == ORT_ENGINE:
return ORTEngine(onnx_file_path, **engine_args)

raise ValueError(
f"Unknown engine_type {engine_type}. Supported values include: "
f"{SUPPORTED_PIPELINE_ENGINES}"
)


def _initialize_executor_and_workers(
batch_size: Optional[int],
workers_or_executor: Optional[Union[int, ThreadPoolExecutor]],
Expand Down
21 changes: 21 additions & 0 deletions src/deepsparse/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,12 @@ class SupportedTasks:
),
)

text_generation = namedtuple("text_generation", ["opt", "codegen", "bloom"])(
codegen=AliasedTask("codegen", []),
Comment thread
dbogunowicz marked this conversation as resolved.
opt=AliasedTask("opt", []),
bloom=AliasedTask("bloom", []),
)

image_classification = namedtuple("image_classification", ["image_classification"])(
image_classification=AliasedTask(
"image_classification",
Expand Down Expand Up @@ -150,6 +156,9 @@ def check_register_task(
# custom task, register the CustomPipeline
import deepsparse.pipelines.custom_pipeline # noqa: F401

elif cls.is_text_generation(task):
import deepsparse.transformers.pipelines.text_generation # noqa: F401

elif cls.is_nlp(task):
# trigger transformers pipelines to register with Pipeline.register
import deepsparse.transformers.pipelines # noqa: F401
Expand Down Expand Up @@ -193,6 +202,18 @@ def check_register_task(
f"{list(all_tasks)}"
)

@classmethod
def is_text_generation(cls, task: str) -> bool:
"""
:param task: the name of the task to check whether it is a text generation task
such as codegen
:return: True if it is a text generation task, False otherwise
"""
return any(
text_generation_task.matches(task)
for text_generation_task in cls.text_generation
)

@classmethod
def is_nlp(cls, task: str) -> bool:
"""
Expand Down
48 changes: 45 additions & 3 deletions src/deepsparse/transformers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ methods such as [pruning](https://neuralmagic.com/blog/pruning-overview/) and [q
These techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics.

This integration currently supports several fundamental NLP tasks:
- **Text Generation** - given the input prompt, generate an output text sequence (e.g. to fill in incomplete text, summarize or paraphrase a text paragraph)
- **Question Answering** - posing questions about a document
- **Sentiment Analysis** - assigning a sentiment to a piece of text
- **Text Classification** - assigning a label or class to a piece of text (e.g duplicate question pairing)
Expand All @@ -32,9 +33,9 @@ This grants the engine the flexibility to serve any model in a framework-agnosti

The DeepSparse pipelines require the following files within a folder on the local server to properly load a Transformers model:
- `model.onnx`: The exported Transformers model in the [ONNX format](https://github.com/onnx/onnx).
- `tokenizer.json`: The [HuggingFace compatible tokenizer configuration](https://huggingface.co/docs/transformers/fast_tokenizers) used with the model.
Comment thread
rahul-tuli marked this conversation as resolved.
- `config.json`: The [HuggingFace compatible configuration file](https://huggingface.co/docs/transformers/main_classes/configuration) used with the model.

- `tokenizer_config.json`: The [HuggingFace compatible tokenizer configuration](https://huggingface.co/docs/transformers/fast_tokenizers) used with the model.
- `tokenizer.json`, `special_tokens_map.json`, `vocab.json`, `merges.txt` (optional): Other files that may be required by a tokenizer
Below we describe two possibilities to obtain the required structure.

#### SparseML Export
Expand All @@ -48,7 +49,7 @@ sparseml.transformers.export_onnx --task question-answering --model_path model_p
```

This creates `model.onnx` file, in the directory of your `model_path`(e.g. `/trained_model/model.onnx`).
The `tokenizer.json` and `config.json` are stored under the `model_path` folder as well, so a DeepSparse pipeline ca be directly instantiated by using that folder after export (e.g. `/trained_model/`).
Any additional, required files, such as e.g.`tokenizer.json` or `config.json`, are stored under the `model_path` folder as well, so a DeepSparse pipeline can be directly instantiated by using that folder after export (e.g. `/trained_model/`).

#### SparseZoo Stub
Alternatively, you can skip the process of the ONNX model export by using Neural Magic's [SparseZoo](https://sparsezoo.neuralmagic.com/). The SparseZoo contains pre-sparsified models and SparseZoo stubs enable you to reference any model on the SparseZoo in a convenient and predictable way.
Expand Down Expand Up @@ -138,6 +139,47 @@ response.text
>> '{"score":0.9534820914268494,"start":8,"end":14,"answer":"batman"}'
```

### Text Generation
The text generation task generates a sequence of tokens given the prompt. Popular text generation LLMs (Large Language Models) are used
for the chatbots (the instruction models), code generation, text summarization, or filling out the missing text. The following example uses a sparsified text classification
OPT model to complete the prompt

[List of available SparseZoo Text Generation Models](
https://sparsezoo.neuralmagic.com/?useCase=text_generation)

#### Python Pipeline
```python
from deepsparse import Pipeline

opt_pipeline = Pipeline.create(task="opt")

inference = opt_pipeline("Who is the president of the United States?")

>> 'The president of the United States is the head of the executive branch of government...'
```

#### HTTP Server
Spinning up:
```bash
deepsparse.server \
task text-generation \
--model_path # TODO: Pending until text generation models get uploaded to SparseZoo
```

Making a request:
```python
import requests

url = "http://localhost:5543/predict" # Server's port default to 5543

obj = {"sequence": "Who is the president of the United States?"}

response = requests.post(url, json=obj)
response.text

>> 'The president of the United States is the head of the executive branch of government...'
```

### Sentiment Analysis
The sentiment analysis task takes in a sentence and classifies its sentiment. The following example
uses a pruned and quantized text sentiment analysis BERT model trained on the `sst2` dataset downloaded
Expand Down
15 changes: 15 additions & 0 deletions src/deepsparse/transformers/engines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# flake8: noqa
from .nl_decoder_engine import *
Loading