Skip to content

Commit

Permalink
Deployed with sha 0ac709b
Browse files Browse the repository at this point in the history
  • Loading branch information
Unknown committed Jul 1, 2024
0 parents commit 5bbc690
Show file tree
Hide file tree
Showing 184 changed files with 16,966 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: af35db413f479da911bb54dc5c091c9e
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/adding_operations.doctree
Binary file not shown.
Binary file added .doctrees/cpp_reference.doctree
Binary file not shown.
Binary file added .doctrees/developer.doctree
Binary file not shown.
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/llm.doctree
Binary file not shown.
Binary file added .doctrees/llm_performance.doctree
Binary file not shown.
Binary file added .doctrees/npu.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/python/modules.doctree
Binary file not shown.
Binary file added .doctrees/setup.doctree
Binary file not shown.
Binary file added .doctrees/usage.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
Binary file added _images/llm_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/npu_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
91 changes: 91 additions & 0 deletions _sources/adding_operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Adding New Operations in the Library

This document outlines the process for integrating a new operation into the existing code library. The integration process involves several key steps: defining the operation's interface, implementing the operation ensuring compatibility with the library's architecture, and providing testing to validate the operation.

An example of implementing new operations can be found here: [Implementing reduce operations](https://github.com/intel/intel-npu-acceleration-library/commit/4f17015a75c146fe8d569ac71a2e2a0a960fc652)

## Step 1: Defining the OpenVINO interface

The first step is defining the call to the OpenVino method of the new operation through the OpenVINO Runtime C++ API. This is done in the `nn_factory.h` header. In this file, a new operation is created by interfacing with the OpenVINO operation. This includes specifying input and output parameters, and data types of the operation's interface and then calling and returning the OpenVINO method. The interface should align with the library's existing design patterns and naming conventions.

A simple example of defining a new operation:
```
ov::op::Op* new_operation(ov::op::Op* input) {
auto new_operation = std::make_shared<ov::opset1::NewOp>(input->output(0));
operations.push_back(new_operation);
return new_operation.get();
}
```
## Step 2: Defining the C++ bindings

The next step is defining the C++ binding in the `binding.cpp` source file. This is the method that will be called in Python. This method has the operation's input node as a parameter and additional arguments of the operation are defined in the method.

An example of defining the binding:
```
intel_npu_acceleration_library_DLL_API ov::op::Op* new_operation(intel_npu_acceleration_library::ModelFactory* factory, ov::op::Op* input) {
return factory->new_operation(input);
}
```

## Step 3: Adding new operation to list of supported operation

The new operation is added to the list of supported NPU operations in the `ops.py` script.
The information of the new operation that must be provided is:
- the operation name
- the number of inputs
- the optional parameters types

## Step 4: Adding extra functionality to the operation's function
Ctypes is used to interface between C++ and Python. (Documentation is found here: [Python Ctypes](https://docs.python.org/3/library/ctypes.html))

If there is additional logic that you may want to add to the function, this can be done by defining a Python function that calls the C++ method in the `factory.py` file.
Otherwise, if you directly call the functions to C++, then you do not need to define a Python function.

## Step 5: Adding PyTorch wrapper for the new operation
Additionally, to define a wrapper to use PyTorch native functions, this can be implemented in the `functional.py` file. In this step, a function of the same name as the PyTorch equivalent is created, which is used instead of the PyTorch implementation of the operation.
If there is additional logic that you may want to add to the function to interface with the new operation, it can also be added in this function.

It is common for the new operation to have the same name as the PyTorch equivalent operation, however this is not always the case and to show which operation we are referring to, we refer to the newly implemented operation as `new_operation` and the PyTorch operation and `operation`.

The basic structure of PyTorch wrapper for a PyTorch operation, referred to as `torch.operation`, which returns the output of the implemented `new_operation`:
```
@implements(torch.operation)
def operation(x: Tensor) -> Tensor:
"""Return the output tensor of the operation.
Args:
x (Tensor): The input tensor.
Returns:
Tensor: Output tensor.
"""
return generate_op(x, "new_operation")
```
## Step 6: Building the library
To update the library, run the command:
```
pip install .
```

## Step 7: Adding tests for the new operation
A test for the new operation can be added in the `test_op.py` script. The new operation should be compared with a reference to ensure correct implementation.

The following is a basic structure to use the new operation:
```
X = torch.rand((16, 128)).to(torch.float16) # defining the input tensor
model = NNFactory()
input = model.parameter(X.shape) # creating the input node
_ = model.new_operation(input) # _ = torch.operation(input) is equivalent if using the PyTorch wrapper
model.compile()
out = model.run(X.numpy())
```

Using pytest to run all of the tests in the file:
```
pytest <name of the file>
```

Using pytest to run a single test in the file:
```
pytest <name of the file>::<name of the test>
```
5 changes: 5 additions & 0 deletions _sources/cpp_reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
C++ API Reference
=================

.. doxygenindex::
:project: Intel® NPU Acceleration Library
87 changes: 87 additions & 0 deletions _sources/developer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Developer Guide

Install developer packages by typing

```bash
pip install .[dev]
```

It is suggested to install the package locally by using `pip install -e .[dev]`

## Git hooks

All developers should install the git hooks that are tracked in the `.githooks` directory. We use the pre-commit framework for hook management. The recommended way of installing it is using pip:

```bash
pre-commit install
```

If you want to manually run all pre-commit hooks on a repository, run `pre-commit run --all-files`. To run individual hooks use `pre-commit run <hook_id>`.

Uninstalling the hooks can be done using

```bash
pre-commit uninstall
```

## Testing the library

### Python test

Python test uses `pytest` library. Type

```bash
cd test/python && pytest
```

to run the full test suite.

## Build the documentation

This project uses `sphinx` to build and deploy the documentation. To serve locally the documentation type

```bash
mkdocs serve
```

to deploy it into github pages type

```bash
cd docs
python build_doc.py gh-deploy
```

## Generate python packages

On windows:

```bat
python setup.py sdist
set CIBW_BUILD=cp*
cibuildwheel --platform windows --output-dir dist
```


## Publishing packets

Install twine
```bat
python3 -m pip install --upgrade twine
```

Then check on the built sdist and wheel that are properly formatted (all files should return a green `PASSED`)

```bat
twine check dist/*
```

Upload the packets to `testpypi`

```bat
twine upload --repository testpypi dist/*
```

To upload them to the real index (**verify first with testpypi**)
```bat
twine upload dist/*
```
114 changes: 114 additions & 0 deletions _sources/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
.. Intel® NPU Acceleration Library documentation master file, created by
sphinx-quickstart on Wed Feb 7 11:48:32 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Intel® NPU Acceleration Library's documentation!
=====================================

The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.

Installation
-------------

Check that your system has an available NPU (`how-to <https://www.intel.com/content/www/us/en/support/articles/000097597/processors.html>`_).

You can install the packet in your machine with

.. code-block:: bash
pip install intel-npu-acceleration-library
Run a LLaMA model on the NPU
----------------------------

To run LLM models you need to install the `transformers` library


.. code-block:: bash
pip install transformers
You are now up and running! You can create a simple script like the following one to run a LLM on the NPU


.. code-block:: python
:emphasize-lines: 2, 7
from transformers import AutoTokenizer, TextStreamer
from intel_npu_acceleration_library import NPUModelForCausalLM
import torch
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = NPUModelForCausalLM.from_pretrained(model_id, use_cache=True, dtype=torch.int8).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]
generation_kwargs = dict(
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
max_new_tokens=512,
)
print("Run inference")
_ = model.generate(**generation_kwargs)
Take note that you only need to use `intel_npu_acceleration_library.compile` to offload the heavy computation to the NPU.

Feel free to check `Usage <usage.html>`_ and `LLM <llm.html>`_ and the `examples <https://github.com/intel/intel-npu-acceleration-library/tree/main/examples>`_ folder for additional use-cases and examples.



Site map
----------------------------

.. toctree::
Quickstart <self>
NPU overview <npu.md>
usage.md
setup.md
:maxdepth: 1
:caption: Library overview:


.. toctree::
llm.md
llm_performance.md
:maxdepth: 1
:caption: Applications:



.. toctree::
developer.md
adding_operation.md
:maxdepth: 1
:caption: Developements guide:



.. toctree::
Python API Reference <python/intel_npu_acceleration_library.rst>
cpp_reference.rst
:maxdepth: 1
:caption: API Reference:




Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
53 changes: 53 additions & 0 deletions _sources/llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Large Language models


## Run an LLM on the NPU

You can use your existing LLM inference script on the NPU with a simple line of code

```python
# First import the library
import intel_npu_acceleration_library

# Call the compile function to offload kernels to the NPU.
model = intel_npu_acceleration_library.compile(model)
```

Here a full example:

```python
from torch.profiler import profile, ProfilerActivity
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
from threading import Thread
import intel_npu_acceleration_library
import torch
import time
import sys

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)


print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model)

query = "What is the meaning of life?"
prefix = tokenizer(query, return_tensors="pt")["input_ids"]


generation_kwargs = dict(
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
)

print("Run inference")
_ = model.generate(**generation_kwargs)

```
Loading

0 comments on commit 5bbc690

Please sign in to comment.