Skip to content

Qengineering/Qwen2-VL-7B-NPU

Repository files navigation

Qwen2-VL-7B NPU

Alt text

User:<image>Describe the image.

Answer: The image depicts an astronaut lounging on the moon, enjoying a drink from a green bottle while sitting in a chair. The astronaut is wearing a white spacesuit and has a helmet with a visor. The background features a view of Earth from the moon's surface, with stars visible in the sky. The scene is set against a black and white backdrop, giving it a surreal and futuristic feel.


Qwen2-7B VLM for RK3588 NPU (Rock 5, Orange Pi 5).

License

Paper: Qwen2 Technical Report

Hugging face: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct


Introduction

LLMs (Large Language Models) are neural networks trained on large text datasets to understand and generate language.
VLMs (Vision-Language Models) add a visual encoder so the model can process images and text together.
A combined VLM+LLM system is often referred to as a multimodal model.

These models can be large—hundreds of millions to billions of parameters—which impacts accuracy, memory use, and runtime speed.
On edge devices like the RK3588, available RAM and compute are limited, and even the NPU has strict constraints on supported operations.
Because of this, models typically need to be quantised or simplified to fit.

Performance is usually expressed in tokens (words) per second.
Once converted to RKNN, parts of the model can run on the NPU, improving speed.
Despite these limits, models like Qwen2-7B run well on the RK3588 because the NPU efficiently accelerates the heavy math, and the vision encoder can be optimised. This makes advanced multimodal AI feasible on small, power-efficient devices.


Model performance benchmark (FPS)

All models, with C++ examples, can be found on the Q-engineering GitHub.

All LLM models are quantized to w8a8, while the VLM vision encoders use fp16.

model RAM (GB)1 llm cold sec2 llm warm sec3 vlm cold sec2 vlm warm sec3 Resolution Tokens/s
Qwen2-7B 8.7 86.6 34.5 37.1 20.7 392 x 392 3.7
Qwen2-2.2B 3.3 29.1 2.5 17.1 1.7 392 x 392 12.5
InternVL3-1B 1.3 6.8 1.1 7.8 0.75 448 x 448 30
SmolVLM2-2.2B 3.4 21.2 2.6 10.5 0.9 384 x 384 11
SmolVLM2-500M 0.8 4.8 0.7 2.5 0.25 384 x 384 31
SmolVLM2-256M 0.5 1.1 0.4 2.5 0.25 384 x 384 54

1 The total used memory; LLM plus the VLM.
2 When an llm/vlm model is loaded for the first time from your disk to RAM or NPU, it is called a cold start.
The duration depends on your OS, I/O transfer rate, and memory mapping.
3 Subsequent loading (warm start) takes advantage of the already mapped data in RAM. Mostly, only a few pointers need to be restored.

Figure_1


Dependencies.

To run the application, you have to:

  • OpenCV 64-bit installed.
  • rkllm library.
  • rknn library.
  • Optional: Code::Blocks. ($ sudo apt-get install codeblocks)

Installing the dependencies.

Start with the usual

$ sudo apt-get update 
$ sudo apt-get upgrade
$ sudo apt-get install cmake wget curl

OpenCV

To install OpenCV on your SBC, follow the Raspberry Pi 4 guide.

Or, when you have no intentions to program code:

$ sudo apt-get install lib-opencv-dev 

Installing the app.

$ git clone https://github.com/Qengineering/Qwen2-VL-7B-NPU

RKLLM, RKNN

To run InternVL3, you need to have the rkllm-runtime library version 1.2.2 (or higher) installed, as well as the rknpu driver version 0.9.8.
If you don't have these on your machine, or if you have a lower version, you need to install them.
We have provided the correct versions in the repo.

$ cd ./InternVL3-NPU/aarch64/library
$ sudo cp ./*.so /usr/local/lib
$ cd ./InternVL3-NPU/aarch64/include
$ sudo cp ./*.h /usr/local/include

Download the LLM and VLM model.

The next step is downloading the models.
Download the two needed files (8.9 GB) from our Sync.com server:
Qwen2-VL-7B_llm_w8a8_rk3588.rkllm and Qwen2_VL_7B_vision_rk3588.rknn
Copy both into this folder.

Building the app.

Once you have the two models, it is time to build your application.
You can use Code::Blocks.

  • Load the project file *.cbp in Code::Blocks.
  • Select Release, not Debug.
  • Compile and run with F9.
  • You can alter command line arguments with Project -> Set programs arguments...

Or use Cmake.

$ mkdir build
$ cd build
$ cmake ..
$ make -j4

Running the app.

The app has the following arguments.

VLM_NPU Picture RKNN_model RKLLM_model NewTokens ContextLength
Argument Comment
picture The image. Provide a dummy if you don't want to use an image
RKNN_model The visual encoder model (vlm)
RKLLM_model The large language model (llm)
NewTokens This sets the maximum number of new tokens. Optional, default 2048
ContextLength This specifies the maximum total number of tokens the model can process. Optional, default 4096


In the context of the Rockchip RK3588 LLM (Large Language Model) library, the parameters NewTokens and ContextLength both control different limits for text generation, and they're typical in LLM workflows.
NewTokens
This sets the maximum number of tokens (pieces of text, typically sub-word units) that the model is allowed to generate in response to a prompt during a single inference round. For example, if set to 300, the model will not return more than 300 tokens as output, regardless of the prompt length. It's important for controlling generation length to avoid too-short or too-long responses, helping manage resource use and output size.
ContextLength
This specifies the maximum total number of tokens the model can process in one go, which includes both the prompt (input) tokens and all generated tokens. For example, if set to 2048 and your prompt already uses 500 tokens, the model can generate up to 2048-500 = 1548 new tokens. This is a hardware and architecture constraint set during model conversion and deployment, as the context window cannot exceed the model's design limit (for instance, 4096 or 8192 tokens depending on the model variant).

A typical command line can be:

VLM_NPU ./Moon.jpg ./models/Qwen2_VL_7B_vision_rk3588.rknn ./models/Qwen2-VL-7B_llm_w8a8_rk3588.rkllm 2048 4096

The NewTokens (2048) and ContextLength (4096) are optional and can be omitted.

Using the app.

Using the application is simple. Once you provide the image and the models, you can ask everything you want.
Remember, we are on a bare Rock5C, so don't expect the same quality answers as ChatGPT can provide.
On the other hand, when you see the examples below, the app performs amazingly well.

If you want to talk about the picture, you need to include the token <image> in your prompt once.
The app remembers the dialogue until you give the token <clear>.
With <exit>, you leave the application.

C++ code.

Below, you find the surprisingly little code of main.cpp.

#include "RK35llm.h"

int main(int argc, char** argv)
{
    std::string input_str;
    std::string output_str;
    RK35llm RKLLM;

    RKLLM.SetInfo(true);            //yes, you may give me additional model information
    RKLLM.SetSilence(false);        //you may print the incremental text chunks on the terminal

    if     (argc< 4) {std::cerr << "Usage: " << argv[0] << " image vlm_model llm_model [option]NewTokens [option]ContextLength\n"; return -1;}
    else if(argc==4) RKLLM.LoadModel(argv[2],argv[3]);
    else if(argc==5) RKLLM.LoadModel(argv[2],argv[3],std::atoi(argv[4]));
    else if(argc> 5) RKLLM.LoadModel(argv[2],argv[3],std::atoi(argv[4]),std::atoi(argv[5]));

    cv::Mat Pic = cv::imread(argv[1]);
    RKLLM.LoadImage(Pic);

    while(true) {
        printf("\n");
        printf("User: ");

        std::getline(std::cin, input_str);
        if (input_str == "exit") break;

        output_str = RKLLM.Ask(input_str);
//        std::cout << "\nLLM Reply: " << output_str << std::endl;
    }

    return 0;
}

Most code speaks for itself. One remark.
The LLM generates the answer to your request in little pieces of text. You see them as if the LLM is typing on your terminal.
RKLLM.SetSilence controls this behaviour. When set to true, it will suppress this output. RKLLM.Ask() also returns the complete answer.
In main.cpp uncomment this line:

std::cout << "\nLLM Answer: " << output_str << std::endl;

VLM examples.

Image Dialog
Singapore_S User: <image>Which city is this?
Answer: This image depicts the skyline of Singapore, featuring the Marina Bay Sands and the Singapore Flyer Ferris wheel. The Marina Bay Sands is one of the largest hotels in the world, while the Singapore Flyer is a popular tourist attraction.
ChineseWall_S User: <image>Tell me something about the history of this place.
Answer: The Great Wall of China is one of the most iconic landmarks in the world, stretching over 13,000 miles (21,000 kilometers) across northern China and parts of Mongolia. It was built to protect against invasions from nomadic tribes and has a rich history dating back several centuries.

The construction of the Great Wall began around the 5th century BC during the Warring States period when various Chinese states were building defensive structures along their borders. The first emperor of the Qin dynasty, Qin Shi Huangdi, unified China in 221 BC and ordered the construction of a continuous wall to protect his empire from invasions.

The Great Wall was built by millions of workers over several centuries, with different sections being constructed or renovated at various times throughout history. It is estimated that around 700,000 people died during its construction, including soldiers, laborers, and prisoners who were forced to work on the wall.

Over time, the Great Wall became a symbol of Chinese culture and identity, and it has been designated as a UNESCO World Heritage Site in recognition of its historical significance. Today, the Great Wall is one of the most popular tourist destinations in China and attracts millions of visitors each year.
Pizza_S User: <image>What are the ingredients I need to make this?
Answer: The pizza in the picture appears to be a Margherita, which is made with fresh mozzarella cheese and fresh basil leaves. The ingredients you would need for this type of pizza include:
1. Fresh Mozzarella Cheese (cheese)
2. Tomato Sauce (sauce)
3. Fresh Basil Leaves (herbs)

You may also want to have some additional toppings like tomato sauce, if desired.

To make the pizza, you will need a pizza stone or baking sheet that can withstand high temperatures. You'll also need a pizza cutter and a rolling pin for shaping the dough.

User: What is the first step?
Answer: The first step in making this Margherita pizza would be to preheat your oven to the desired temperature, typically around 450°F (230°C), and prepare the pizza stone or baking sheet.

If you're using a pizza stone, make sure it's hot by placing it directly on the rack of the oven. If you're using a baking sheet, place it in the oven and preheat for about 10 minutes before adding your toppings.

Next, you'll need to prepare the dough if you haven't already. You can either use store-bought pizza dough or make your own from scratch by following a recipe that includes flour, yeast, salt, and water.

Once the dough is ready, it's time to add the toppings. In this case, you would spread tomato sauce over the entire surface of the dough and then sprinkle fresh mozzarella cheese and fresh basil leaves on top. You can also add additional toppings like tomato sauce if desired.

Finally, place the pizza in the oven and bake until the crust is golden brown and the cheese is melted and bubbly.
Titanic_S User: <image>Which movie is this?
Answer: Answer: The image appears to be from the movie "Titanic."

User: Can you tell me the plot?
Answer: Sure, "Titanic" is a romantic drama film set on the ill-fated maiden voyage of the RMS Titanic. The story follows the love affair between Jack Dawson (played by Leonardo DiCaprio) and Rose DeWitt Bukater (played by Kate Winslet). The movie explores themes of class, romance, and tragedy as the ship sinks in the North Atlantic Ocean.

LLM example.

Notice the nice Markdown makeup of the answer!

User: Can you make me a C++ call that embeds the Pythagorean theorem?

Answer: Sure, here's an example of how to use the Pythagorean theorem in C++:

#include <cmath>
#include <iostream>

int main() {
    double side1 = 3.0;
    double side2 = 4.0;
    double hypotenuse = sqrt(side1 * side1 + side2 * side2);
    std::cout << "The length of the hypotenuse is: " << hypotenuse << std::endl;
    return 0;
}

This code uses the sqrt function from the <cmath> library to calculate the square root of the sum of squares of two sides (side1 and side2) using the Pythagorean theorem. The result is then printed to the console. Here's how it works:

double hypotenuse = sqrt(side1 * side1 + side2 * side2);

This line calculates the square root of the sum of squares of two sides (side1 and side2) using the Pythagorean theorem. The result is then stored in the variable hypotenuse.


paypal