[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239

EgorDuplensky · 2024-08-26T17:06:44Z

Details:

Introduce the following operations to the internal opset
- FullyConnected (MatMul with transposed constant second input)
- FullyConnectedCompressed (FullyConnected with weights compression)
- FullyConnectedQuantizedLegacy (FullyConnected with quantized activations and weights and dequantize scale and zero point pulled through the Op by LPT)
- FullyConnectedQuantized (FullyConnected with quantization scales and zero points on activation, weights and outputs). Planned to be used in scope of dynamic quantization. Can be used for a static quantization as well in the future.
- Placeholder (a NoOp input, which replaces a default / optional value for an operation, i.e. Bias for FullyConnected)
The following transformations were added / updated:
- ConvertFullyConnectedToFullyConnectedCompressed (replaces proprietary ~~FuseFCAndWeightsDecompression~~)
- ConvertFCToFCQuantizedLegacy replaces proprietary ~~FuseConvMatmulFCDeconvAndDQScales~~
- FullyConnectedBiasFusion (added into CPU folder for now, needs to be checked and review by GPU team before adaptation to internal opset). Replaces proprietary ~~FuseConvolutionMatMulDeconvAndBias~~
- ConvertMatMulToFC updated to use ov::op::internal:FullyConnected, planned to be moved to internal opset after review from GPU team

Tickets:

149923

src/plugins/intel_cpu/src/dnnl_extension_utils.cpp

Fix Windows warning Introduce FullyConnectedQuantizeLegacy

src/common/transformations/src/transformations/op_conversions/convert_fc_to_compressed.cpp

vladimir-paramuzov · 2024-09-19T05:27:30Z

src/common/transformations/include/ov_ops/fully_connected.hpp

+
+    std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;
+
+    virtual std::shared_ptr<Node> fuse_bias(const ov::Output<Node>& bias) const;


Do we need a method for that, especially virtual? I think bias fusion can be simply done via clone_with_new_inputs for any FC type:

ov::OutputVector inputs; for (size_t i = 0; i < fc->get_input_size(); i++) { inputs.push_back(fc->get_input_source_output(i)); // would be nice to have a such helper in ov::Node } inputs[2] = bias; fc->clone_with_new_inputs(inputs);

Agree, virtual method is not necessary anymore.
This was implemented before I have added a constructor which accepts OutputVector.
But maybe we still could benefit from some class static function.

vladimir-paramuzov · 2024-09-19T05:30:26Z

src/common/transformations/include/ov_ops/fully_connected_quantized_legacy.hpp

+                                  const ov::Output<Node>& bias,
+                                  const ov::Output<Node>& deq_scales,
+                                  const ov::Output<Node>& deq_zero_points,
+                                  const ov::element::Type output_type = ov::element::undefined);


Can it be expressed as:

FullyConnectedQuantized(X, W, bias, placeholder, placeholder, placeholder, placeholder, deq_scales, deq_zero_points)

?

Probably we will have to do this, if we are not going to change current MatcherPass logic.
I mean currently MatcherPass requires a pattern to have the same number of the inputs as a node which is being matched.
If we relax this requirement (maybe not by default, but by providing some configuration option to the MatcherPass), so we can only specify the node inputs we really care about, then it will be possible to have Operations with variable size of the inputs (still strictly ordered, but the optional tail inputs can be omitted, and we would not need to add so many placeholders for the inputs.
But potentially it can bring some other problems.
@itikhono What do you think?

vladimir-paramuzov · 2024-09-19T05:32:02Z

src/common/transformations/src/ov_ops/fully_connected.cpp

+    op.set_transpose_b(true);
+
+    auto out_shapes =
+        ov::op::v0::shape_infer(&op,


I think we can have shape_infer function for FullyConnected op to avoid duplication of the MatMul op creation and setup

Agree, we better to implement one

vladimir-paramuzov · 2024-09-19T05:36:05Z

src/common/transformations/src/ov_ops/fully_connected_quantized_legacy.cpp

+    validate_and_infer_types();
+}
+
+FullyConnectedQuantizedLegacy::FullyConnectedQuantizedLegacy(const ov::Output<Node>& X,


Do we need these c-tors with explicit args list of we have one with OutputVector already?

This is not really necessary, but from my opinion it simplifies transformations code.
This is also related to the possibility of having a variable size of inputs.
For example, this is the code which clones FullyConnectedCompressed GPU version:

std::shared_ptr<ov::Node> FullyConnectedCompressed::clone_with_new_inputs(const ov::OutputVector& new_args) const { check_new_args_count(this, new_args); if (new_args.size() == 4) return std::make_shared<FullyConnectedCompressed>(new_args.at(0), new_args.at(1), new_args.at(2), new_args.at(3), m_output_type); else if (new_args.size() == 5) return std::make_shared<FullyConnectedCompressed>(new_args.at(0), new_args.at(1), new_args.at(2), new_args.at(3), new_args.at(4), m_output_type); else if (new_args.size() == 6) return std::make_shared<FullyConnectedCompressed>(new_args.at(0), new_args.at(1), new_args.at(2), new_args.at(3), new_args.at(4), new_args.at(6), m_output_type); else OPENVINO_THROW("Unexpected inputs count for FullyConnectedCompressed op: ", new_args.size());}

And this is how it looks with an extra constructor:

std::shared_ptr<ov::Node> FullyConnectedCompressed::clone_with_new_inputs(const ov::OutputVector& new_args) const { check_new_args_count(this, new_args); return std::make_shared<FullyConnectedCompressed>(new_args, m_output_type); }

But to have the second version we need only c-tor with OutputVector arg, right? So I'm curious if we can just remove others. From the transformations perspective it can be either simpler (if we just take args from another node and add/modify something. In another cases it's almost same (will need extra braces only)

dmitry-gorokhov · 2024-09-30T09:25:40Z

src/common/transformations/include/ov_ops/fully_connected_compressed.hpp

+
+class TRANSFORMATIONS_API FullyConnectedCompressed : public ov::op::internal::FullyConnected {
+public:
+    OPENVINO_OP("FullyConnectedCompressed", "gpu_opset");


"gpu_opset"? Should be "ie_internal_opset" I guess

dmitry-gorokhov · 2024-09-30T09:33:41Z

src/common/transformations/src/ov_ops/fully_connected_compressed.cpp

+    validate_and_infer_types();
+}
+
+// FullyConnectedCompressed::FullyConnectedCompressed(const ov::Output<Node>& X,


dmitry-gorokhov · 2024-09-30T10:00:17Z

src/plugins/intel_cpu/src/cpu_types.cpp

What about other functions like "NameFromType"?

dmitry-gorokhov · 2024-09-30T10:01:45Z

src/plugins/intel_cpu/src/dnnl_postops_composer.cpp

+                    ", scale data size: ",
+                    scaleSize);
+
+    if (scaleSize > DQScales.size())


IS this condition possible at all?

dmitry-gorokhov · 2024-09-30T10:04:53Z

src/plugins/intel_cpu/src/dnnl_postops_composer.cpp

+            DQScales[i] *= scalesData[i];
+        }
+    }
+    if (std::all_of(DQScales.begin(), DQScales.end(), [&](float val) {


This logic looks like ooposite the the code on line 52. Do we really need to broadcast the value first?

dmitry-gorokhov · 2024-09-30T11:42:31Z

src/plugins/intel_cpu/src/nodes/executors/dnnl/dnnl_matmul_primitive.cpp

@@ -104,10 +104,11 @@ DnnlMemoryDescPtr DnnlMatMulPrimitive::makeTransposedWeightDescriptor(const Dnnl
    const auto& weiDesc = srcDesc->getDnnlDesc();
    auto wDims = weiDesc.get_dims();
    auto wDataType = weiDesc.get_data_type();
-    std::swap(wDims[wDims.size() - 1], wDims[wDims.size() - 2]);
+    dnnl::memory::dim batchDim = std::accumulate(wDims.begin(), wDims.end() - 1, 1, std::multiplies<dnnl::memory::dim>());


This logic is duplicated in many files. Maybe it worth to extend normalizeDimsTo2D to some FC utility file

dmitry-gorokhov · 2024-09-30T12:19:10Z

src/plugins/intel_cpu/src/shape_inference/custom/fullyconnected.cpp


    // activation   weight    output_shape
    // NCHW         CoCHW     NCo
    // TNC          CoC       TNCo
    // NC           CoC       NCo
    VectorDims outputShape(out_rank, 1);
    // set Co
-    outputShape.back() = weightShape[0];
+    outputShape.back() = std::accumulate(weightShape.begin(), weightShape.end() - 1, 1, std::multiplies<Dim>());


I am wondering does FullyConnected::validate_and_infer_types() always return 2D shape on the output?

dmitry-gorokhov · 2024-09-30T12:20:11Z

src/plugins/intel_cpu/tests/functional/cmake/target_per_test.cmake

@@ -96,7 +96,9 @@ endif()
 endfunction()

 if(ENABLE_CPU_SPECIFIC_TARGET_PER_TEST)
-  create_target_per_test_for_directory(${CMAKE_CURRENT_SOURCE_DIR}/custom/subgraph_tests/src ov_cpu_func_subgraph)
+    create_target_per_test_for_directory(${CMAKE_CURRENT_SOURCE_DIR}/custom/subgraph_tests/src ov_cpu_func_subgraph)


Different line alignemt

dmitry-gorokhov · 2024-09-30T12:30:36Z

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/fc_bias_fusion.cpp

+        },
+        has_single_consumer);
+
+    auto m_fc_c = ov::pass::pattern::wrap_type<ov::op::internal::FullyConnectedCompressed>(


What about compressed FC with zero-points input. Will it be matched?

dmitry-gorokhov · 2024-09-30T12:31:21Z

src/plugins/intel_cpu/src/transformations/cpu_opset/convert_to_cpu_specific_opset.hpp

+    };
+
+    CPU_REGISTER_PASS_X64(manager, pass::ConvertFullyConnectedToFullyConnectedCompressed,
+                          supported_compression_types,


Original graph optimizer pass has limitation on activations precision. Is it handled inside the transformation?

I also don't see limitation on avx2 precision.

https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/graph_optimizer.cpp#L436-L449 seems to be missed.

We use place HW specific limitations inside isSupportedOperation function. Worth to align the behavior

v-Golubev

Ideally, we need to cover new transformations by unit tests

v-Golubev · 2024-10-01T08:11:13Z

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

@@ -21,6 +21,7 @@
 // Common transformations
 #include "transformations/common_optimizations/mark_precision_sensitive_shapeof_subgraphs.hpp"
 #include "transformations/common_optimizations/add_fake_quantize_fusion.hpp"
+#include "transformations/common_optimizations/reshape_prelu.hpp"


Useless include

v-Golubev · 2024-10-01T09:56:15Z

src/plugins/intel_cpu/src/transformations/cpu_opset/convert_to_cpu_specific_opset.hpp

Please cleanup this file: it contains some debug code

v-Golubev · 2024-10-01T10:01:32Z

src/plugins/intel_cpu/src/transformations/cpu_opset/convert_to_cpu_specific_opset.hpp

+        ov::element::f4e2m1,
+    };
+
+    CPU_REGISTER_PASS_X64(manager, pass::ConvertFullyConnectedToFullyConnectedCompressed,


From my perspective, ideally the conversion to Compressed FC should happen at the beginning of the whole CPU transformation pipeline (as it is done for CompressedGather).
Do I understand correctly that the conversion will be moved in follow-up PR?

v-Golubev · 2024-10-01T10:02:47Z

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/move_fc_reshape_to_weights.cpp

@@ -2,7 +2,8 @@
 // SPDX-License-Identifier: Apache-2.0
 //

-#include "transformations/cpu_opset/common/op/fully_connected.hpp"
+// #include "transformations/cpu_opset/common/op/fully_connected.hpp"


Suggested change

// #include "transformations/cpu_opset/common/op/fully_connected.hpp"

v-Golubev · 2024-10-01T10:03:45Z

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/convert_to_power_static.cpp

@@ -12,7 +12,8 @@
 #include "openvino/pass/pattern/op/or.hpp"
 #include "transformations/rt_info/dequantization_node.hpp"
 #include "transformations/cpu_opset/common/op/power_static.hpp"
-#include "transformations/cpu_opset/common/op/fully_connected.hpp"
+#include "ov_ops/fully_connected.hpp"
+// #include "transformations/cpu_opset/common/op/fully_connected.hpp"


Suggested change

// #include "transformations/cpu_opset/common/op/fully_connected.hpp"

v-Golubev · 2024-10-01T15:03:47Z

src/common/transformations/src/transformations/op_conversions/convert_fc_to_compressed.cpp

+            fc_input_b = transpose->clone_with_new_inputs({fc_input_b->output(0), transpose_const});
+            ov::disable_constant_folding(fc_input_b);
+            result_nodes.push_back(fc_input_b);
+            fc_input_scale = transpose->clone_with_new_inputs({scale->output(0), transpose_const});


In case of scalar scale we don't need to transpose it. Should we add the corresponding check as it is done for zero point?

v-Golubev · 2024-10-01T15:06:12Z

src/common/transformations/src/transformations/op_conversions/convert_fc_to_compressed.cpp

+        return in_ps.rank().is_static() && out_ps.rank().is_static() && in_ps.size() == 3 && out_ps.size() == 2;
+    };
+
+    auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant);


Please use the existing predicate instead of custom one:

Suggested change

auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant);

auto weights_m = wrap_type<ov::op::v0::Constant>(ov::pass::pattern::type_matches_any(supported_compression_types));

v-Golubev · 2024-10-01T15:07:15Z

...common/transformations/src/transformations/op_conversions/convert_fc_to_quantized_legacy.cpp

+    auto quantized_weights = [](const ov::Output<ov::Node>& output) {
+        return output.get_element_type() == ov::element::i8;
+    };
+
+    auto quantized_activations = [](const ov::Output<ov::Node>& output) {
+        return output.get_element_type() == ov::element::u8 || output.get_element_type() == ov::element::i8;
+    };


Please use type_matches and type_matches_any instead of custom predicates

v-Golubev · 2024-10-01T15:17:45Z

src/plugins/intel_cpu/src/nodes/fullyconnected.cpp

+            (std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr &&
+             std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) {


We can write much shorter:

Suggested change

(std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr &&

std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) {

(!ov::is_type<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) &&

!ov::is_type<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)))) {

v-Golubev · 2024-10-01T15:43:20Z

src/common/transformations/src/ov_ops/fully_connected_compressed.cpp

+    : FullyConnected(OutputVector(arguments.begin(), arguments.begin() + 3), output_type) {
+    for (size_t i = 3; i < arguments.size(); i++) {
+        set_argument(i, arguments[i]);
+    }
+    validate_and_infer_types();
+}


To be honest, I don't really like constructors from OutputVector. This is not safe enough, since they allow user to pass vector with any size to the constructor, and e.g. if arguments.size() < 3, we read out of memory here, so we can get hardly understandable error in runtime.
I understand benefits of this constructor: it allows to make some other class methods shorter. Maybe we can make this constructor not public then? And make sure before each constructor creation that arguments size is applicable (it seems like this is true for all the existing use cases)

github-actions · 2024-10-18T00:22:58Z

This PR will be closed in a week because of 2 weeks of no activity.

EgorDuplensky requested review from a team as code owners August 26, 2024 17:06

EgorDuplensky requested review from itikhono and removed request for a team August 26, 2024 17:06

github-actions bot added category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations labels Aug 26, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from d318f8c to d444d87 Compare August 27, 2024 09:38

EgorDuplensky commented Aug 27, 2024

View reviewed changes

src/plugins/intel_cpu/src/dnnl_extension_utils.cpp Outdated Show resolved Hide resolved

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 4 times, most recently from 707a7cf to 983cfbe Compare August 28, 2024 13:07

EgorDuplensky requested a review from a team as a code owner August 28, 2024 13:07

github-actions bot added the category: build OpenVINO cmake script / infra label Aug 28, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 4 times, most recently from ca9e51d to 3fa55f6 Compare September 2, 2024 12:21

EgorDuplensky force-pushed the quantized_fullyconnected_op branch from 3fa55f6 to f405ccc Compare September 5, 2024 16:31

github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Sep 5, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from 7a4891b to 24e0a1b Compare September 5, 2024 17:20

github-actions bot removed the category: Core OpenVINO Core (aka ngraph) label Sep 5, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from cb22763 to b5a7adf Compare September 8, 2024 16:08

EgorDuplensky requested a review from a team as a code owner September 8, 2024 16:08

github-actions bot added the category: IR FE OpenVINO IR v10 / v11 FrontEnd label Sep 8, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch from b5a7adf to 9029bc1 Compare September 8, 2024 16:37

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 4 times, most recently from 341d62d to 1ca73d3 Compare September 8, 2024 19:42

EgorDuplensky requested review from a team as code owners September 8, 2024 19:42

github-actions bot added the category: IE Tests OpenVINO Test: plugins and common label Sep 8, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch from 1ca73d3 to 0250c68 Compare September 9, 2024 14:59

github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Sep 9, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch from 0250c68 to 9a6c9d1 Compare September 16, 2024 20:58

github-actions bot removed the category: Core OpenVINO Core (aka ngraph) label Sep 16, 2024

EgorDuplensky force-pushed the quantized_fullyconnected_op branch 2 times, most recently from 98e7192 to bbca7ef Compare September 18, 2024 12:46

[CPU] Introduce FullyConnectedQuantized op and bias fusing

5a22d05

Fix Windows warning Introduce FullyConnectedQuantizeLegacy

EgorDuplensky force-pushed the quantized_fullyconnected_op branch from bbca7ef to 5a22d05 Compare September 18, 2024 12:55

EgorDuplensky assigned vladimir-paramuzov, dmitry-gorokhov and itikhono Sep 18, 2024

vladimir-paramuzov reviewed Sep 19, 2024

View reviewed changes

EgorDuplensky changed the title ~~[CPU] Introduce FullyConnectedQuantized op and bias fusing~~ [CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder Sep 24, 2024

dmitry-gorokhov reviewed Sep 30, 2024

View reviewed changes

v-Golubev reviewed Oct 1, 2024

View reviewed changes

v-Golubev self-assigned this Oct 8, 2024

github-actions bot added the Stale label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239

[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239

EgorDuplensky commented Aug 26, 2024 •

edited

Loading

vladimir-paramuzov Sep 19, 2024

EgorDuplensky Sep 19, 2024

vladimir-paramuzov Sep 19, 2024

EgorDuplensky Sep 19, 2024 •

edited

Loading

vladimir-paramuzov Sep 19, 2024 •

edited

Loading

EgorDuplensky Sep 19, 2024

vladimir-paramuzov Sep 19, 2024

EgorDuplensky Sep 19, 2024 •

edited

Loading

vladimir-paramuzov Sep 20, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

dmitry-gorokhov Sep 30, 2024

v-Golubev left a comment

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

v-Golubev Oct 1, 2024

github-actions bot commented Oct 18, 2024


		std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override;

		virtual std::shared_ptr<Node> fuse_bias(const ov::Output<Node>& bias) const;

	auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant);
	auto weights_m = wrap_type<ov::op::v0::Constant>(ov::pass::pattern::type_matches_any(supported_compression_types));

		(std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr &&
		std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) {

[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239

Are you sure you want to change the base?

[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239

Conversation

EgorDuplensky commented Aug 26, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorDuplensky Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

vladimir-paramuzov Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorDuplensky Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-Golubev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 18, 2024

EgorDuplensky commented Aug 26, 2024 •

edited

Loading

EgorDuplensky Sep 19, 2024 •

edited

Loading

vladimir-paramuzov Sep 19, 2024 •

edited

Loading

EgorDuplensky Sep 19, 2024 •

edited

Loading