[NPUW] L0 allocation improvements #27011

smirnov-alexey · 2024-10-11T11:39:57Z

EISW-142611

smirnov-alexey · 2024-10-11T13:18:34Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

@@ -372,7 +373,7 @@ void ov::npuw::JustInferRequest::bind_global_parameters(std::size_t idx) {
    auto& comp_model_desc = m_npuw_model->m_compiled_submodels[idx];
    const auto real_idx = comp_model_desc.replaced_by.value_or(idx);

-    const bool do_copy = needs_copy(idx);
+    const bool do_copy = !m_alloc_required && needs_copy(idx);


Not sure if it's correct

Here we check global parameters. The idea is that global parameters allocation solely depends on m_alloc_required - if it's set, params will be allocated on NPU

haven't checked on this yet

I think m_alloc_required is overall misleading. It is always required.

Also, you've missed this in my past L0 PR: dmatveev#5

More precisely, this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L117

And then this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L370

The backstory here is that, even if you've allocated your model-global input tensors yourself, they may be overwritten. Even our scripts carelessly do this, unfortunately. So what you need to do is to keep track of the tensors you allocate (maybe you can just memorize the pointers in your new allocTensor method) and check if the tensors' your working with are still "known" to you.

Once there's a set_tensor call - you loose it and your m_alloc_required flag doesn't tell truth.

Good point, thanks

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

…exey/openvino into as/npuw_alloc_tensors

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

src/plugins/intel_npu/src/plugin/npuw/compiled_model.hpp

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

dmatveev · 2024-10-11T16:57:43Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

@@ -372,7 +373,7 @@ void ov::npuw::JustInferRequest::bind_global_parameters(std::size_t idx) {
    auto& comp_model_desc = m_npuw_model->m_compiled_submodels[idx];
    const auto real_idx = comp_model_desc.replaced_by.value_or(idx);

-    const bool do_copy = needs_copy(idx);
+    const bool do_copy = !m_alloc_required && needs_copy(idx);


haven't checked on this yet

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

dmatveev · 2024-10-11T17:00:01Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+    {
+        std::lock_guard<std::mutex> guard(m_alloc_mutex);
+        m_remote_ctx = m_npuw_model->get_plugin()->get_core()->get_default_context(device)._ptr;
+        remote_tensor = m_remote_ctx->create_host_tensor(type, shape);
+        allocated_tensor = ov::make_tensor(remote_tensor);


why do you need a mutex here? you call allocTensor from multiple threads?

Just in case since we do guard allocation in banks. Less error prone in the future

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.hpp

dmatveev · 2024-10-11T17:31:32Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+                        m_spatial_io[real_idx].input_tails[p.idx] = ov::get_tensor_impl(
+                            allocTensor(iport.get_element_type(), iport.get_shape(), *proto_comp_model_desc.device_it));


just thinking.. if you're using allocTensor only where we store ITensors, why can't allocTensor return the ITensor so you don't need to call get_tensor_impl everywhere?

dmatveev · 2024-10-11T17:35:39Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

@@ -153,7 +154,7 @@ ov::npuw::JustInferRequest::JustInferRequest(const std::shared_ptr<ov::npuw::Com
    LOG_INFO("Preallocating input tensors...");
    for (size_t i = 0; i < m_npuw_model->inputs().size(); i++) {
        const auto& port = m_npuw_model->inputs()[i];
-        m_input_tensors.push_back(ov::get_tensor_impl(ov::Tensor(port.get_element_type(), port.get_shape())));
+        m_input_tensors.push_back(ov::get_tensor_impl(allocTensor(port.get_element_type(), port.get_shape())));


maybe we need an overload for allocTensor which takes ov::Input<ov::Node> / ov::Output<ov::Node>.

note - you're not passing any device here. And you have ="NPU" as the default parameter..

About the default device - that was the idea

dmatveev · 2024-10-11T17:44:36Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

@@ -372,7 +373,7 @@ void ov::npuw::JustInferRequest::bind_global_parameters(std::size_t idx) {
    auto& comp_model_desc = m_npuw_model->m_compiled_submodels[idx];
    const auto real_idx = comp_model_desc.replaced_by.value_or(idx);

-    const bool do_copy = needs_copy(idx);
+    const bool do_copy = !m_alloc_required && needs_copy(idx);


I think m_alloc_required is overall misleading. It is always required.

Also, you've missed this in my past L0 PR: dmatveev#5

More precisely, this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L117

And then this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L370

The backstory here is that, even if you've allocated your model-global input tensors yourself, they may be overwritten. Even our scripts carelessly do this, unfortunately. So what you need to do is to keep track of the tensors you allocate (maybe you can just memorize the pointers in your new allocTensor method) and check if the tensors' your working with are still "known" to you.

Once there's a set_tensor call - you loose it and your m_alloc_required flag doesn't tell truth.

dmatveev · 2024-10-11T17:58:16Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+    if (!m_alloc_required || device == "CPU") {
+        return ov::Tensor(type, shape);
+    }


this and having device="NPU" by default is looks a very obscure way to allocate the tensors in the right region.

So you're using this method in tree contexts:

For function results - where device is taken into account

For global inputs, where device is not passed and defaults to "NPU" (meh)

For global results, same?

Also the logic completely discards the WEIGHTS_BANK_ALLOC setting we have here. OK, this is not weights, but that flag exists for a reason you know. Why do we think we bypass that problem here?

A clearer way how it could've been done:

No default arguments here - just always pass the device for yourself

Add two methods to the CompiledModel so frame the decision making logic:

std::string global_mem_device()

std::string funcall_mem_device(idx)

The first only takes device distribution into account.
The second takes the subgraph affinity into account (you can access it).
BOTH should also take the BANK_ALLOC into account if that's set.

dmatveev

Minor comments here

dmatveev · 2024-10-14T12:28:23Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

+    for (std::size_t idx = 0; idx < m_compiled_submodels.size(); ++idx) {
+        auto& comp_model_desc = m_compiled_submodels[idx];
+        if (!comp_model_desc.compiled_model) {
+            continue;
+        }
+        if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
+            return "NPU";
+        }
+    }


Device distribution would be a simpler check

There is only log_device_distribution which goes over the submodels and prints info in-place. It's not stored anywhere

dmatveev · 2024-10-14T12:30:27Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

+    auto& comp_model_desc = m_compiled_submodels[idx];
+    if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
+        return "NPU";
+    }
+
+    return "CPU";


This is strange, what it is supposed to be is

Suggested change

auto& comp_model_desc = m_compiled_submodels[idx];

if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {

return "NPU";

}

return "CPU";

return *comp_model_desc.device_it;

So the only point was to take BANK_ALLOC into account if it is set - and do it in just one place

…onsumption

dmatveev · 2024-10-14T13:58:21Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

+    for (std::size_t idx = 0; idx < m_compiled_submodels.size(); ++idx) {
+        auto& comp_model_desc = m_compiled_submodels[idx];
+        if (!comp_model_desc.compiled_model) {
+            continue;
+        }
+        if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
+            return "NPU";
+        }
+    }


dmatveev · 2024-10-14T14:12:05Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

    return "CPU";
+
+    // Force globally set device if set
+    const std::string device_alloc = m_cfg.get<::intel_npu::NPUW_WEIGHTS_BANK_ALLOC>();
+    if (!device_alloc.empty()) {
+        return device_alloc;
+    }
+
+    auto& comp_model_desc = m_compiled_submodels[idx];
+    return *comp_model_desc.device_it;


Static analysis will for sure fire unreachable code here, so the past body should've stayed under #if 0. But let's see if it works

Support in/out/intermediate tensor allocation on NPU

87e1022

smirnov-alexey added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 11, 2024

smirnov-alexey added this to the 2024.5 milestone Oct 11, 2024

smirnov-alexey requested a review from dmatveev October 11, 2024 11:39

smirnov-alexey assigned dmatveev Oct 11, 2024

smirnov-alexey requested review from a team as code owners October 11, 2024 11:39

smirnov-alexey mentioned this pull request Oct 11, 2024

[NPUW] Support in/out/intermediate tensor allocation on NPU dmatveev/openvino#15

Closed

Fix bug in banks and in device check

335ae52

smirnov-alexey commented Oct 11, 2024

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey added 2 commits October 11, 2024 15:38

Fix windows build

b431b13

Fix clang format

b534478

dmatveev changed the title ~~[NPUW] Support in/out/intermediate tensor allocation on NPU~~ [NPUW] L0 allocation improvements Oct 11, 2024

smirnov-alexey added 5 commits October 11, 2024 16:31

Check zero shape

a21c5cd

Merge branch 'as/npuw_alloc_tensors' of https://github.com/smirnov-al…

f50509e

…exey/openvino into as/npuw_alloc_tensors

Fix clang format

29d72e8

Fix linux build

ab964cb

Fix race condition during allocation

e4003f9

dmatveev requested changes Oct 11, 2024

View reviewed changes

Address review comments

d4be3fb

dmatveev reviewed Oct 14, 2024

View reviewed changes

smirnov-alexey added 3 commits October 14, 2024 12:36

Address review comments

29af786

Hard-code intermediate tensors allocation to CPU due to high memory c…

25ef09f

…onsumption

Minor refactoring

47fdd64

dmatveev approved these changes Oct 14, 2024

View reviewed changes

dmatveev enabled auto-merge October 14, 2024 14:00

Revert some of the previous change

5958af8

dmatveev reviewed Oct 14, 2024

View reviewed changes

smirnov-alexey added 2 commits October 14, 2024 16:22

Merge branch 'master' into as/npuw_alloc_tensors

f70238e

Merge branch 'master' into as/npuw_alloc_tensors

52e3859

dmatveev added this pull request to the merge queue Oct 14, 2024

Merged via the queue into openvinotoolkit:master with commit 062762b Oct 14, 2024
134 checks passed

dmatveev deleted the as/npuw_alloc_tensors branch October 14, 2024 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPUW] L0 allocation improvements #27011

[NPUW] L0 allocation improvements #27011

smirnov-alexey commented Oct 11, 2024

smirnov-alexey Oct 11, 2024

smirnov-alexey Oct 11, 2024

dmatveev Oct 11, 2024

dmatveev Oct 11, 2024

smirnov-alexey Oct 14, 2024

dmatveev Oct 11, 2024

dmatveev Oct 11, 2024

smirnov-alexey Oct 14, 2024 •

edited

Loading

dmatveev Oct 11, 2024

smirnov-alexey Oct 14, 2024

dmatveev Oct 11, 2024

dmatveev Oct 11, 2024

smirnov-alexey Oct 14, 2024

smirnov-alexey Oct 14, 2024

dmatveev Oct 11, 2024

dmatveev Oct 11, 2024

smirnov-alexey Oct 14, 2024

dmatveev left a comment

dmatveev Oct 14, 2024

smirnov-alexey Oct 14, 2024

dmatveev Oct 14, 2024

dmatveev Oct 14, 2024

dmatveev Oct 14, 2024

smirnov-alexey Oct 14, 2024

dmatveev Oct 14, 2024

dmatveev Oct 14, 2024

		m_spatial_io[real_idx].input_tails[p.idx] = ov::get_tensor_impl(
		allocTensor(iport.get_element_type(), iport.get_shape(), *proto_comp_model_desc.device_it));

[NPUW] L0 allocation improvements #27011

[NPUW] L0 allocation improvements #27011

Conversation

smirnov-alexey commented Oct 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smirnov-alexey Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatveev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smirnov-alexey Oct 14, 2024 •

edited

Loading