Deployed with sha 0ac709b

intel · Jul 1, 2024 · 5bbc690 · 5bbc690
commit 5bbc690
Show file tree

Hide file tree

Showing 184 changed files with 16,966 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: af35db413f479da911bb54dc5c091c9e
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/adding_operations.doctree b/.doctrees/adding_operations.doctree
diff --git a/.doctrees/cpp_reference.doctree b/.doctrees/cpp_reference.doctree
diff --git a/.doctrees/developer.doctree b/.doctrees/developer.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/llm.doctree b/.doctrees/llm.doctree
diff --git a/.doctrees/llm_performance.doctree b/.doctrees/llm_performance.doctree
diff --git a/.doctrees/npu.doctree b/.doctrees/npu.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.backend.doctree b/.doctrees/python/intel_npu_acceleration_library.backend.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.doctree b/.doctrees/python/intel_npu_acceleration_library.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.functional.doctree b/.doctrees/python/intel_npu_acceleration_library.functional.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.nn.doctree b/.doctrees/python/intel_npu_acceleration_library.nn.doctree
diff --git a/.doctrees/python/modules.doctree b/.doctrees/python/modules.doctree
diff --git a/.doctrees/setup.doctree b/.doctrees/setup.doctree
diff --git a/.doctrees/usage.doctree b/.doctrees/usage.doctree
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/llm_perf.png b/_images/llm_perf.png
diff --git a/_images/npu_arch.png b/_images/npu_arch.png
diff --git a/_sources/adding_operations.md b/_sources/adding_operations.md
@@ -0,0 +1,91 @@
+# Adding New Operations in the Library
+
+This document outlines the process for integrating a new operation into the existing code library. The integration process involves several key steps: defining the operation's interface, implementing the operation ensuring compatibility with the library's architecture, and providing testing to validate the operation.
+
+An example of implementing new operations can be found here: [Implementing reduce operations](https://github.com/intel/intel-npu-acceleration-library/commit/4f17015a75c146fe8d569ac71a2e2a0a960fc652)
+
+## Step 1: Defining the OpenVINO interface
+
+The first step is defining the call to the OpenVino method of the new operation through the OpenVINO Runtime C++ API. This is done in the `nn_factory.h` header. In this file, a new operation is created by interfacing with the OpenVINO operation. This includes specifying input and output parameters, and data types of the operation's interface and then calling and returning the OpenVINO method. The interface should align with the library's existing design patterns and naming conventions.
+
+A simple example of defining a new operation:
+```
+ov::op::Op* new_operation(ov::op::Op* input) {
+    auto new_operation = std::make_shared<ov::opset1::NewOp>(input->output(0));
+    operations.push_back(new_operation);
+    return new_operation.get();
+}
+```
+## Step 2: Defining the C++ bindings
+
+The next step is defining the C++ binding in the `binding.cpp` source file. This is the method that will be called in Python. This method has the operation's input node as a parameter and additional arguments of the operation are defined in the method.
+
+An example of defining the binding:
+```
+intel_npu_acceleration_library_DLL_API ov::op::Op* new_operation(intel_npu_acceleration_library::ModelFactory* factory, ov::op::Op* input) {
+    return factory->new_operation(input);
+}
+```
+
+## Step 3: Adding new operation to list of supported operation
+
+The new operation is added to the list of supported NPU operations in the `ops.py` script.
+The information of the new operation that must be provided is:
+- the operation name
+- the number of inputs
+- the optional parameters types
+
+## Step 4: Adding extra functionality to the operation's function
+Ctypes is used to interface between C++ and Python. (Documentation is found here: [Python Ctypes](https://docs.python.org/3/library/ctypes.html))
+
+If there is additional logic that you may want to add to the function, this can be done by defining a Python function that calls the C++ method in the `factory.py` file.
+Otherwise, if you directly call the functions to C++, then you do not need to define a Python function.
+
+## Step 5: Adding PyTorch wrapper for the new operation
+Additionally, to define a wrapper to use PyTorch native functions, this can be implemented in the `functional.py` file. In this step, a function of the same name as the PyTorch equivalent is created, which is used instead of the PyTorch implementation of the operation.
+If there is additional logic that you may want to add to the function to interface with the new operation, it can also be added in this function.
+
+It is common for the new operation to have the same name as the PyTorch equivalent operation, however this is not always the case and to show which operation we are referring to, we refer to the newly implemented operation as `new_operation` and the PyTorch operation and `operation`.
+
+The basic structure of PyTorch wrapper for a PyTorch operation, referred to as `torch.operation`, which returns the output of the implemented `new_operation`:
+```
+@implements(torch.operation)
+def operation(x: Tensor) -> Tensor:
+    """Return the output tensor of the operation.
+
+    Args:
+        x (Tensor): The input tensor.
+    Returns:
+        Tensor: Output tensor.
+    """
+    return generate_op(x, "new_operation")
+```
+## Step 6: Building the library
+To update the library, run the command:
+```
+pip install .
+```
+
+## Step 7: Adding tests for the new operation
+A test for the new operation can be added in the `test_op.py` script. The new operation should be compared with a reference to ensure correct implementation.
+
+The following is a basic structure to use the new operation:
+```
+X = torch.rand((16, 128)).to(torch.float16)  # defining the input tensor
+
+model = NNFactory()
+input = model.parameter(X.shape)             # creating the input node
+_ = model.new_operation(input)               # _ = torch.operation(input) is equivalent if using the PyTorch wrapper
+model.compile()
+out = model.run(X.numpy())
+```
+
+Using pytest to run all of the tests in the file:
+```
+pytest <name of the file>
+```
+
+Using pytest to run a single test in the file:
+```
+pytest <name of the file>::<name of the test>
+```
diff --git a/_sources/cpp_reference.rst b/_sources/cpp_reference.rst
@@ -0,0 +1,5 @@
+C++ API Reference
+=================
+
+.. doxygenindex::
+   :project: Intel® NPU Acceleration Library
diff --git a/_sources/developer.md b/_sources/developer.md
@@ -0,0 +1,87 @@
+# Developer Guide
+
+Install developer packages by typing
+
+```bash
+pip install .[dev]
+```
+
+It is suggested to install the package locally by using `pip install -e .[dev]`
+
+## Git hooks
+
+All developers should install the git hooks that are tracked in the `.githooks` directory. We use the pre-commit framework for hook management. The recommended way of installing it is using pip:
+
+```bash
+pre-commit install
+```
+
+If you want to manually run all pre-commit hooks on a repository, run `pre-commit run --all-files`. To run individual hooks use `pre-commit run <hook_id>`.
+
+Uninstalling the hooks can be done using
+
+```bash
+pre-commit uninstall
+```
+
+## Testing the library
+
+### Python test
+
+Python test uses `pytest` library. Type
+
+```bash
+cd test/python && pytest
+```
+
+to run the full test suite.
+
+## Build the documentation
+
+This project uses `sphinx` to build and deploy the documentation. To serve locally the documentation type
+
+```bash
+mkdocs serve
+```
+
+to deploy it into github pages type
+
+```bash
+cd docs
+python build_doc.py gh-deploy
+```
+
+## Generate python packages
+
+On windows:
+
+```bat
+python setup.py sdist
+set CIBW_BUILD=cp*
+cibuildwheel --platform windows --output-dir dist
+```
+
+
+## Publishing packets
+
+Install twine
+```bat
+python3 -m pip install --upgrade twine
+```
+
+Then check on the built sdist and wheel that are properly formatted (all files should return a green `PASSED`)
+
+```bat
+twine check dist/*
+```
+
+Upload the packets to `testpypi`
+
+```bat
+twine upload --repository testpypi dist/*
+```
+
+To upload them to the real index (**verify first with testpypi**)
+```bat
+twine upload dist/*
+```
diff --git a/_sources/index.rst b/_sources/index.rst
@@ -0,0 +1,114 @@
+.. Intel® NPU Acceleration Library documentation master file, created by
+   sphinx-quickstart on Wed Feb  7 11:48:32 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to Intel® NPU Acceleration Library's documentation!
+=====================================
+
+The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
+
+Installation
+-------------
+
+Check that your system has an available NPU (`how-to <https://www.intel.com/content/www/us/en/support/articles/000097597/processors.html>`_).
+
+You can install the packet in your machine with
+
+.. code-block:: bash
+
+   pip install intel-npu-acceleration-library
+
+
+Run a LLaMA model on the NPU
+----------------------------
+
+To run LLM models you need to install the `transformers` library
+
+
+.. code-block:: bash
+
+   pip install transformers
+
+You are now up and running! You can create a simple script like the following one to run a LLM on the NPU
+
+
+.. code-block:: python
+   :emphasize-lines: 2, 7
+
+   from transformers import AutoTokenizer, TextStreamer
+   from intel_npu_acceleration_library import NPUModelForCausalLM
+   import torch
+
+   model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
+
+   model = NPUModelForCausalLM.from_pretrained(model_id, use_cache=True, dtype=torch.int8).eval()
+   tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
+   tokenizer.pad_token_id = tokenizer.eos_token_id
+   streamer = TextStreamer(tokenizer, skip_special_tokens=True)
+
+   query = input("Ask something: ")
+   prefix = tokenizer(query, return_tensors="pt")["input_ids"]
+
+   generation_kwargs = dict(
+      input_ids=prefix,
+      streamer=streamer,
+      do_sample=True,
+      top_k=50,
+      top_p=0.9,
+      max_new_tokens=512,
+   )
+
+   print("Run inference")
+   _ = model.generate(**generation_kwargs)
+
+
+Take note that you only need to use `intel_npu_acceleration_library.compile` to offload the heavy computation to the NPU.
+
+Feel free to check `Usage <usage.html>`_ and `LLM <llm.html>`_ and the `examples <https://github.com/intel/intel-npu-acceleration-library/tree/main/examples>`_ folder for additional use-cases and examples.
+
+
+
+Site map
+----------------------------
+
+.. toctree::
+   Quickstart <self>
+   NPU overview <npu.md>
+   usage.md
+   setup.md
+   :maxdepth: 1
+   :caption: Library overview:
+
+
+.. toctree::
+   llm.md
+   llm_performance.md
+   :maxdepth: 1
+   :caption: Applications:
+
+
+
+.. toctree::
+   developer.md
+   adding_operation.md
+   :maxdepth: 1
+   :caption: Developements guide:
+
+
+
+.. toctree::
+   Python API Reference <python/intel_npu_acceleration_library.rst>
+   cpp_reference.rst
+   :maxdepth: 1
+   :caption: API Reference:
+
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/_sources/llm.md b/_sources/llm.md
@@ -0,0 +1,53 @@
+# Large Language models
+
+
+## Run an LLM on the NPU
+
+You can use your existing LLM inference script on the NPU with a simple line of code
+
+```python
+# First import the library
+import intel_npu_acceleration_library
+
+# Call the compile function to offload kernels to the NPU.
+model = intel_npu_acceleration_library.compile(model)
+```
+
+Here a full example:
+
+```python
+from torch.profiler import profile, ProfilerActivity
+from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
+from threading import Thread
+import intel_npu_acceleration_library
+import torch
+import time
+import sys
+
+model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
+
+model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
+tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
+tokenizer.pad_token_id = tokenizer.eos_token_id
+streamer = TextStreamer(tokenizer, skip_special_tokens=True)
+
+
+print("Compile model for the NPU")
+model = intel_npu_acceleration_library.compile(model)
+
+query = "What is the meaning of life?"
+prefix = tokenizer(query, return_tensors="pt")["input_ids"]
+
+
+generation_kwargs = dict(
+    input_ids=prefix,
+    streamer=streamer,
+    do_sample=True,
+    top_k=50,
+    top_p=0.9,
+)
+
+print("Run inference")
+_ = model.generate(**generation_kwargs)
+
+```