Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

EgorDuplensky
Copy link
Contributor

@EgorDuplensky EgorDuplensky commented Aug 26, 2024

Details:

  1. Introduce the following operations to the internal opset

    • FullyConnected (MatMul with transposed constant second input)
    • FullyConnectedCompressed (FullyConnected with weights compression)
    • FullyConnectedQuantizedLegacy (FullyConnected with quantized activations and weights and dequantize scale and zero point pulled through the Op by LPT)
    • FullyConnectedQuantized (FullyConnected with quantization scales and zero points on activation, weights and outputs). Planned to be used in scope of dynamic quantization. Can be used for a static quantization as well in the future.
    • Placeholder (a NoOp input, which replaces a default / optional value for an operation, i.e. Bias for FullyConnected)
  2. The following transformations were added / updated:

    • ConvertFullyConnectedToFullyConnectedCompressed (replaces proprietary FuseFCAndWeightsDecompression)
    • ConvertFCToFCQuantizedLegacy replaces proprietary FuseConvMatmulFCDeconvAndDQScales
    • FullyConnectedBiasFusion (added into CPU folder for now, needs to be checked and review by GPU team before adaptation to internal opset). Replaces proprietary FuseConvolutionMatMulDeconvAndBias
    • ConvertMatMulToFC updated to use ov::op::internal:FullyConnected, planned to be moved to internal opset after review from GPU team

Tickets:

  • 149923

@EgorDuplensky EgorDuplensky requested review from a team as code owners August 26, 2024 17:06
@EgorDuplensky EgorDuplensky requested review from itikhono and removed request for a team August 26, 2024 17:06
@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations labels Aug 26, 2024
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from d318f8c to d444d87 Compare August 27, 2024 09:38
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 4 times, most recently from 707a7cf to 983cfbe Compare August 28, 2024 13:07
@EgorDuplensky EgorDuplensky requested a review from a team as a code owner August 28, 2024 13:07
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Aug 28, 2024
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 4 times, most recently from ca9e51d to 3fa55f6 Compare September 2, 2024 12:21
@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Sep 5, 2024
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from 7a4891b to 24e0a1b Compare September 5, 2024 17:20
@github-actions github-actions bot removed the category: Core OpenVINO Core (aka ngraph) label Sep 5, 2024
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from cb22763 to b5a7adf Compare September 8, 2024 16:08
@EgorDuplensky EgorDuplensky requested a review from a team as a code owner September 8, 2024 16:08
@github-actions github-actions bot added the category: IR FE OpenVINO IR v10 / v11 FrontEnd label Sep 8, 2024
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 4 times, most recently from 341d62d to 1ca73d3 Compare September 8, 2024 19:42
@EgorDuplensky EgorDuplensky requested review from a team as code owners September 8, 2024 19:42
@github-actions github-actions bot added the category: IE Tests OpenVINO Test: plugins and common label Sep 8, 2024
@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Sep 9, 2024
@github-actions github-actions bot removed the category: Core OpenVINO Core (aka ngraph) label Sep 16, 2024
@EgorDuplensky EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from 98e7192 to bbca7ef Compare September 18, 2024 12:46
Fix Windows warning

Introduce FullyConnectedQuantizeLegacy

std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;

virtual std::shared_ptr<Node> fuse_bias(const ov::Output<Node>& bias) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a method for that, especially virtual? I think bias fusion can be simply done via clone_with_new_inputs for any FC type:

        ov::OutputVector inputs;
        for (size_t i = 0; i < fc->get_input_size(); i++) {
            inputs.push_back(fc->get_input_source_output(i)); // would be nice to have a such helper in ov::Node
        }
        inputs[2] = bias;
        fc->clone_with_new_inputs(inputs);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, virtual method is not necessary anymore.
This was implemented before I have added a constructor which accepts OutputVector.
But maybe we still could benefit from some class static function.

const ov::Output<Node>& bias,
const ov::Output<Node>& deq_scales,
const ov::Output<Node>& deq_zero_points,
const ov::element::Type output_type = ov::element::undefined);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be expressed as:

    FullyConnectedQuantized(X, W, bias, placeholder, placeholder, placeholder, placeholder, deq_scales, deq_zero_points)

?

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we will have to do this, if we are not going to change current MatcherPass logic.
I mean currently MatcherPass requires a pattern to have the same number of the inputs as a node which is being matched.
If we relax this requirement (maybe not by default, but by providing some configuration option to the MatcherPass), so we can only specify the node inputs we really care about, then it will be possible to have Operations with variable size of the inputs (still strictly ordered, but the optional tail inputs can be omitted, and we would not need to add so many placeholders for the inputs.
But potentially it can bring some other problems.
@itikhono What do you think?

op.set_transpose_b(true);

auto out_shapes =
ov::op::v0::shape_infer(&op,
Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have shape_infer function for FullyConnected op to avoid duplication of the MatMul op creation and setup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we better to implement one

validate_and_infer_types();
}

FullyConnectedQuantizedLegacy::FullyConnectedQuantizedLegacy(const ov::Output<Node>& X,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these c-tors with explicit args list of we have one with OutputVector already?

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really necessary, but from my opinion it simplifies transformations code.
This is also related to the possibility of having a variable size of inputs.
For example, this is the code which clones FullyConnectedCompressed GPU version:

std::shared_ptr<ov::Node> FullyConnectedCompressed::clone_with_new_inputs(const ov::OutputVector& new_args) const {
    check_new_args_count(this, new_args);

    if (new_args.size() == 4)
        return std::make_shared<FullyConnectedCompressed>(new_args.at(0),
                                                          new_args.at(1),
                                                          new_args.at(2),
                                                          new_args.at(3),
                                                          m_output_type);
    else if (new_args.size() == 5)
        return std::make_shared<FullyConnectedCompressed>(new_args.at(0),
                                                          new_args.at(1),
                                                          new_args.at(2),
                                                          new_args.at(3),
                                                          new_args.at(4),
                                                          m_output_type);
    else if (new_args.size() == 6)
        return std::make_shared<FullyConnectedCompressed>(new_args.at(0),
                                                          new_args.at(1),
                                                          new_args.at(2),
                                                          new_args.at(3),
                                                          new_args.at(4),
                                                          new_args.at(6),
                                                          m_output_type);
    else
        OPENVINO_THROW("Unexpected inputs count for FullyConnectedCompressed op: ", new_args.size());}

And this is how it looks with an extra constructor:

std::shared_ptr<ov::Node> FullyConnectedCompressed::clone_with_new_inputs(const ov::OutputVector& new_args) const {
    check_new_args_count(this, new_args);

    return std::make_shared<FullyConnectedCompressed>(new_args, m_output_type);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But to have the second version we need only c-tor with OutputVector arg, right? So I'm curious if we can just remove others. From the transformations perspective it can be either simpler (if we just take args from another node and add/modify something. In another cases it's almost same (will need extra braces only)

@EgorDuplensky EgorDuplensky changed the title [CPU] Introduce FullyConnectedQuantized op and bias fusing [CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder Sep 24, 2024

class TRANSFORMATIONS_API FullyConnectedCompressed : public ov::op::internal::FullyConnected {
public:
OPENVINO_OP("FullyConnectedCompressed", "gpu_opset");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gpu_opset"? Should be "ie_internal_opset" I guess

validate_and_infer_types();
}

// FullyConnectedCompressed::FullyConnectedCompressed(const ov::Output<Node>& X,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other functions like "NameFromType"?

", scale data size: ",
scaleSize);

if (scaleSize > DQScales.size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS this condition possible at all?

DQScales[i] *= scalesData[i];
}
}
if (std::all_of(DQScales.begin(), DQScales.end(), [&](float val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks like ooposite the the code on line 52. Do we really need to broadcast the value first?

@@ -104,10 +104,11 @@ DnnlMemoryDescPtr DnnlMatMulPrimitive::makeTransposedWeightDescriptor(const Dnnl
const auto& weiDesc = srcDesc->getDnnlDesc();
auto wDims = weiDesc.get_dims();
auto wDataType = weiDesc.get_data_type();
std::swap(wDims[wDims.size() - 1], wDims[wDims.size() - 2]);
dnnl::memory::dim batchDim = std::accumulate(wDims.begin(), wDims.end() - 1, 1, std::multiplies<dnnl::memory::dim>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is duplicated in many files. Maybe it worth to extend normalizeDimsTo2D to some FC utility file


// activation weight output_shape
// NCHW CoCHW NCo
// TNC CoC TNCo
// NC CoC NCo
VectorDims outputShape(out_rank, 1);
// set Co
outputShape.back() = weightShape[0];
outputShape.back() = std::accumulate(weightShape.begin(), weightShape.end() - 1, 1, std::multiplies<Dim>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering does FullyConnected::validate_and_infer_types() always return 2D shape on the output?

@@ -96,7 +96,9 @@ endif()
endfunction()

if(ENABLE_CPU_SPECIFIC_TARGET_PER_TEST)
create_target_per_test_for_directory(${CMAKE_CURRENT_SOURCE_DIR}/custom/subgraph_tests/src ov_cpu_func_subgraph)
create_target_per_test_for_directory(${CMAKE_CURRENT_SOURCE_DIR}/custom/subgraph_tests/src ov_cpu_func_subgraph)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different line alignemt

},
has_single_consumer);

auto m_fc_c = ov::pass::pattern::wrap_type<ov::op::internal::FullyConnectedCompressed>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about compressed FC with zero-points input. Will it be matched?

};

CPU_REGISTER_PASS_X64(manager, pass::ConvertFullyConnectedToFullyConnectedCompressed,
supported_compression_types,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Original graph optimizer pass has limitation on activations precision. Is it handled inside the transformation?
  2. I also don't see limitation on avx2 precision.
  3. https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/graph_optimizer.cpp#L436-L449 seems to be missed.
  4. We use place HW specific limitations inside isSupportedOperation function. Worth to align the behavior

Copy link
Contributor

@v-Golubev v-Golubev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we need to cover new transformations by unit tests

@@ -21,6 +21,7 @@
// Common transformations
#include "transformations/common_optimizations/mark_precision_sensitive_shapeof_subgraphs.hpp"
#include "transformations/common_optimizations/add_fake_quantize_fusion.hpp"
#include "transformations/common_optimizations/reshape_prelu.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless include

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cleanup this file: it contains some debug code

ov::element::f4e2m1,
};

CPU_REGISTER_PASS_X64(manager, pass::ConvertFullyConnectedToFullyConnectedCompressed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, ideally the conversion to Compressed FC should happen at the beginning of the whole CPU transformation pipeline (as it is done for CompressedGather).
Do I understand correctly that the conversion will be moved in follow-up PR?

@@ -2,7 +2,8 @@
// SPDX-License-Identifier: Apache-2.0
//

#include "transformations/cpu_opset/common/op/fully_connected.hpp"
// #include "transformations/cpu_opset/common/op/fully_connected.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// #include "transformations/cpu_opset/common/op/fully_connected.hpp"

@@ -12,7 +12,8 @@
#include "openvino/pass/pattern/op/or.hpp"
#include "transformations/rt_info/dequantization_node.hpp"
#include "transformations/cpu_opset/common/op/power_static.hpp"
#include "transformations/cpu_opset/common/op/fully_connected.hpp"
#include "ov_ops/fully_connected.hpp"
// #include "transformations/cpu_opset/common/op/fully_connected.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// #include "transformations/cpu_opset/common/op/fully_connected.hpp"

fc_input_b = transpose->clone_with_new_inputs({fc_input_b->output(0), transpose_const});
ov::disable_constant_folding(fc_input_b);
result_nodes.push_back(fc_input_b);
fc_input_scale = transpose->clone_with_new_inputs({scale->output(0), transpose_const});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of scalar scale we don't need to transpose it. Should we add the corresponding check as it is done for zero point?

return in_ps.rank().is_static() && out_ps.rank().is_static() && in_ps.size() == 3 && out_ps.size() == 2;
};

auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the existing predicate instead of custom one:

Suggested change
auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant);
auto weights_m = wrap_type<ov::op::v0::Constant>(ov::pass::pattern::type_matches_any(supported_compression_types));

Comment on lines +23 to +29
auto quantized_weights = [](const ov::Output<ov::Node>& output) {
return output.get_element_type() == ov::element::i8;
};

auto quantized_activations = [](const ov::Output<ov::Node>& output) {
return output.get_element_type() == ov::element::u8 || output.get_element_type() == ov::element::i8;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use type_matches and type_matches_any instead of custom predicates

Comment on lines +58 to +59
(std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr &&
std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write much shorter:

Suggested change
(std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr &&
std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) {
(!ov::is_type<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) &&
!ov::is_type<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)))) {

Comment on lines +15 to +20
: FullyConnected(OutputVector(arguments.begin(), arguments.begin() + 3), output_type) {
for (size_t i = 3; i < arguments.size(); i++) {
set_argument(i, arguments[i]);
}
validate_and_infer_types();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't really like constructors from OutputVector. This is not safe enough, since they allow user to pass vector with any size to the constructor, and e.g. if arguments.size() < 3, we read out of memory here, so we can get hardly understandable error in runtime.
I understand benefits of this constructor: it allows to make some other class methods shorter. Maybe we can make this constructor not public then? And make sure before each constructor creation that arguments size is applicable (it seems like this is true for all the existing use cases)

@v-Golubev v-Golubev self-assigned this Oct 8, 2024
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin category: IE Tests OpenVINO Test: plugins and common category: IR FE OpenVINO IR v10 / v11 FrontEnd category: transformations OpenVINO Runtime library - Transformations Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants