Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NPUW] L0 allocation improvements #27011

Merged
merged 16 commits into from
Oct 14, 2024

Conversation

smirnov-alexey
Copy link
Contributor

EISW-142611

@@ -372,7 +373,7 @@ void ov::npuw::JustInferRequest::bind_global_parameters(std::size_t idx) {
auto& comp_model_desc = m_npuw_model->m_compiled_submodels[idx];
const auto real_idx = comp_model_desc.replaced_by.value_or(idx);

const bool do_copy = needs_copy(idx);
const bool do_copy = !m_alloc_required && needs_copy(idx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we check global parameters. The idea is that global parameters allocation solely depends on m_alloc_required - if it's set, params will be allocated on NPU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked on this yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think m_alloc_required is overall misleading. It is always required.

Also, you've missed this in my past L0 PR: dmatveev#5

More precisely, this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L117

And then this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L370

The backstory here is that, even if you've allocated your model-global input tensors yourself, they may be overwritten. Even our scripts carelessly do this, unfortunately. So what you need to do is to keep track of the tensors you allocate (maybe you can just memorize the pointers in your new allocTensor method) and check if the tensors' your working with are still "known" to you.

Once there's a set_tensor call - you loose it and your m_alloc_required flag doesn't tell truth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks

@dmatveev dmatveev changed the title [NPUW] Support in/out/intermediate tensor allocation on NPU [NPUW] L0 allocation improvements Oct 11, 2024
src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp Outdated Show resolved Hide resolved
src/plugins/intel_npu/src/plugin/npuw/compiled_model.hpp Outdated Show resolved Hide resolved
@@ -372,7 +373,7 @@ void ov::npuw::JustInferRequest::bind_global_parameters(std::size_t idx) {
auto& comp_model_desc = m_npuw_model->m_compiled_submodels[idx];
const auto real_idx = comp_model_desc.replaced_by.value_or(idx);

const bool do_copy = needs_copy(idx);
const bool do_copy = !m_alloc_required && needs_copy(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked on this yet

Comment on lines +937 to +941
{
std::lock_guard<std::mutex> guard(m_alloc_mutex);
m_remote_ctx = m_npuw_model->get_plugin()->get_core()->get_default_context(device)._ptr;
remote_tensor = m_remote_ctx->create_host_tensor(type, shape);
allocated_tensor = ov::make_tensor(remote_tensor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a mutex here? you call allocTensor from multiple threads?

Copy link
Contributor Author

@smirnov-alexey smirnov-alexey Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case since we do guard allocation in banks. Less error prone in the future

Comment on lines 71 to 72
m_spatial_io[real_idx].input_tails[p.idx] = ov::get_tensor_impl(
allocTensor(iport.get_element_type(), iport.get_shape(), *proto_comp_model_desc.device_it));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking.. if you're using allocTensor only where we store ITensors, why can't allocTensor return the ITensor so you don't need to call get_tensor_impl everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -153,7 +154,7 @@ ov::npuw::JustInferRequest::JustInferRequest(const std::shared_ptr<ov::npuw::Com
LOG_INFO("Preallocating input tensors...");
for (size_t i = 0; i < m_npuw_model->inputs().size(); i++) {
const auto& port = m_npuw_model->inputs()[i];
m_input_tensors.push_back(ov::get_tensor_impl(ov::Tensor(port.get_element_type(), port.get_shape())));
m_input_tensors.push_back(ov::get_tensor_impl(allocTensor(port.get_element_type(), port.get_shape())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need an overload for allocTensor which takes ov::Input<ov::Node> / ov::Output<ov::Node>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note - you're not passing any device here. And you have ="NPU" as the default parameter..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the default device - that was the idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -372,7 +373,7 @@ void ov::npuw::JustInferRequest::bind_global_parameters(std::size_t idx) {
auto& comp_model_desc = m_npuw_model->m_compiled_submodels[idx];
const auto real_idx = comp_model_desc.replaced_by.value_or(idx);

const bool do_copy = needs_copy(idx);
const bool do_copy = !m_alloc_required && needs_copy(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think m_alloc_required is overall misleading. It is always required.

Also, you've missed this in my past L0 PR: dmatveev#5

More precisely, this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L117

And then this part: https://github.com/dmatveev/openvino/blob/e7d62f1a4412f639d0fb112e4f5647eeff9a1b8e/src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp#L370

The backstory here is that, even if you've allocated your model-global input tensors yourself, they may be overwritten. Even our scripts carelessly do this, unfortunately. So what you need to do is to keep track of the tensors you allocate (maybe you can just memorize the pointers in your new allocTensor method) and check if the tensors' your working with are still "known" to you.

Once there's a set_tensor call - you loose it and your m_alloc_required flag doesn't tell truth.

Comment on lines 926 to 928
if (!m_alloc_required || device == "CPU") {
return ov::Tensor(type, shape);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and having device="NPU" by default is looks a very obscure way to allocate the tensors in the right region.

So you're using this method in tree contexts:

  1. For function results - where device is taken into account
  2. For global inputs, where device is not passed and defaults to "NPU" (meh)
  3. For global results, same?

Also the logic completely discards the WEIGHTS_BANK_ALLOC setting we have here. OK, this is not weights, but that flag exists for a reason you know. Why do we think we bypass that problem here?

A clearer way how it could've been done:

  1. No default arguments here - just always pass the device for yourself
  2. Add two methods to the CompiledModel so frame the decision making logic:
    • std::string global_mem_device()
    • std::string funcall_mem_device(idx)

The first only takes device distribution into account.
The second takes the subgraph affinity into account (you can access it).
BOTH should also take the BANK_ALLOC into account if that's set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments here

Comment on lines +505 to +513
for (std::size_t idx = 0; idx < m_compiled_submodels.size(); ++idx) {
auto& comp_model_desc = m_compiled_submodels[idx];
if (!comp_model_desc.compiled_model) {
continue;
}
if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
return "NPU";
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device distribution would be a simpler check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only log_device_distribution which goes over the submodels and prints info in-place. It's not stored anywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh...

Comment on lines 525 to 530
auto& comp_model_desc = m_compiled_submodels[idx];
if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
return "NPU";
}

return "CPU";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange, what it is supposed to be is

Suggested change
auto& comp_model_desc = m_compiled_submodels[idx];
if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
return "NPU";
}
return "CPU";
return *comp_model_desc.device_it;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only point was to take BANK_ALLOC into account if it is set - and do it in just one place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +505 to +513
for (std::size_t idx = 0; idx < m_compiled_submodels.size(); ++idx) {
auto& comp_model_desc = m_compiled_submodels[idx];
if (!comp_model_desc.compiled_model) {
continue;
}
if (ov::npuw::util::starts_with(*comp_model_desc.device_it, "NPU")) {
return "NPU";
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh...

@dmatveev dmatveev enabled auto-merge October 14, 2024 14:00
Comment on lines 522 to 531
return "CPU";

// Force globally set device if set
const std::string device_alloc = m_cfg.get<::intel_npu::NPUW_WEIGHTS_BANK_ALLOC>();
if (!device_alloc.empty()) {
return device_alloc;
}

auto& comp_model_desc = m_compiled_submodels[idx];
return *comp_model_desc.device_it;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static analysis will for sure fire unreachable code here, so the past body should've stayed under #if 0. But let's see if it works

@dmatveev dmatveev added this pull request to the merge queue Oct 14, 2024
Merged via the queue into openvinotoolkit:master with commit 062762b Oct 14, 2024
134 checks passed
@dmatveev dmatveev deleted the as/npuw_alloc_tensors branch October 14, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants