-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239
base: master
Are you sure you want to change the base?
[CPU] Introduce FullyConnected, FCQuantized, FCCompressed, Placeholder #26239
Conversation
d318f8c
to
d444d87
Compare
707a7cf
to
983cfbe
Compare
ca9e51d
to
3fa55f6
Compare
3fa55f6
to
f405ccc
Compare
7a4891b
to
24e0a1b
Compare
cb22763
to
b5a7adf
Compare
b5a7adf
to
9029bc1
Compare
341d62d
to
1ca73d3
Compare
1ca73d3
to
0250c68
Compare
0250c68
to
9a6c9d1
Compare
98e7192
to
bbca7ef
Compare
Fix Windows warning Introduce FullyConnectedQuantizeLegacy
bbca7ef
to
5a22d05
Compare
src/common/transformations/src/transformations/op_conversions/convert_fc_to_compressed.cpp
Show resolved
Hide resolved
|
||
std::shared_ptr<Node> clone_with_new_inputs(const ov::OutputVector& new_args) const override; | ||
|
||
virtual std::shared_ptr<Node> fuse_bias(const ov::Output<Node>& bias) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a method for that, especially virtual? I think bias fusion can be simply done via clone_with_new_inputs
for any FC type:
ov::OutputVector inputs;
for (size_t i = 0; i < fc->get_input_size(); i++) {
inputs.push_back(fc->get_input_source_output(i)); // would be nice to have a such helper in ov::Node
}
inputs[2] = bias;
fc->clone_with_new_inputs(inputs);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, virtual method is not necessary anymore.
This was implemented before I have added a constructor which accepts OutputVector.
But maybe we still could benefit from some class static function.
const ov::Output<Node>& bias, | ||
const ov::Output<Node>& deq_scales, | ||
const ov::Output<Node>& deq_zero_points, | ||
const ov::element::Type output_type = ov::element::undefined); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be expressed as:
FullyConnectedQuantized(X, W, bias, placeholder, placeholder, placeholder, placeholder, deq_scales, deq_zero_points)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we will have to do this, if we are not going to change current MatcherPass logic.
I mean currently MatcherPass requires a pattern to have the same number of the inputs as a node which is being matched.
If we relax this requirement (maybe not by default, but by providing some configuration option to the MatcherPass), so we can only specify the node inputs we really care about, then it will be possible to have Operations with variable size of the inputs (still strictly ordered, but the optional tail inputs can be omitted, and we would not need to add so many placeholders for the inputs.
But potentially it can bring some other problems.
@itikhono What do you think?
op.set_transpose_b(true); | ||
|
||
auto out_shapes = | ||
ov::op::v0::shape_infer(&op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have shape_infer function for FullyConnected
op to avoid duplication of the MatMul op creation and setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we better to implement one
validate_and_infer_types(); | ||
} | ||
|
||
FullyConnectedQuantizedLegacy::FullyConnectedQuantizedLegacy(const ov::Output<Node>& X, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these c-tors with explicit args list of we have one with OutputVector already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really necessary, but from my opinion it simplifies transformations code.
This is also related to the possibility of having a variable size of inputs.
For example, this is the code which clones FullyConnectedCompressed GPU version:
std::shared_ptr<ov::Node> FullyConnectedCompressed::clone_with_new_inputs(const ov::OutputVector& new_args) const {
check_new_args_count(this, new_args);
if (new_args.size() == 4)
return std::make_shared<FullyConnectedCompressed>(new_args.at(0),
new_args.at(1),
new_args.at(2),
new_args.at(3),
m_output_type);
else if (new_args.size() == 5)
return std::make_shared<FullyConnectedCompressed>(new_args.at(0),
new_args.at(1),
new_args.at(2),
new_args.at(3),
new_args.at(4),
m_output_type);
else if (new_args.size() == 6)
return std::make_shared<FullyConnectedCompressed>(new_args.at(0),
new_args.at(1),
new_args.at(2),
new_args.at(3),
new_args.at(4),
new_args.at(6),
m_output_type);
else
OPENVINO_THROW("Unexpected inputs count for FullyConnectedCompressed op: ", new_args.size());}
And this is how it looks with an extra constructor:
std::shared_ptr<ov::Node> FullyConnectedCompressed::clone_with_new_inputs(const ov::OutputVector& new_args) const {
check_new_args_count(this, new_args);
return std::make_shared<FullyConnectedCompressed>(new_args, m_output_type);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But to have the second version we need only c-tor with OutputVector
arg, right? So I'm curious if we can just remove others. From the transformations perspective it can be either simpler (if we just take args from another node and add/modify something. In another cases it's almost same (will need extra braces only)
|
||
class TRANSFORMATIONS_API FullyConnectedCompressed : public ov::op::internal::FullyConnected { | ||
public: | ||
OPENVINO_OP("FullyConnectedCompressed", "gpu_opset"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"gpu_opset"? Should be "ie_internal_opset" I guess
validate_and_infer_types(); | ||
} | ||
|
||
// FullyConnectedCompressed::FullyConnectedCompressed(const ov::Output<Node>& X, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other functions like "NameFromType"?
", scale data size: ", | ||
scaleSize); | ||
|
||
if (scaleSize > DQScales.size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS this condition possible at all?
DQScales[i] *= scalesData[i]; | ||
} | ||
} | ||
if (std::all_of(DQScales.begin(), DQScales.end(), [&](float val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks like ooposite the the code on line 52. Do we really need to broadcast the value first?
@@ -104,10 +104,11 @@ DnnlMemoryDescPtr DnnlMatMulPrimitive::makeTransposedWeightDescriptor(const Dnnl | |||
const auto& weiDesc = srcDesc->getDnnlDesc(); | |||
auto wDims = weiDesc.get_dims(); | |||
auto wDataType = weiDesc.get_data_type(); | |||
std::swap(wDims[wDims.size() - 1], wDims[wDims.size() - 2]); | |||
dnnl::memory::dim batchDim = std::accumulate(wDims.begin(), wDims.end() - 1, 1, std::multiplies<dnnl::memory::dim>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is duplicated in many files. Maybe it worth to extend normalizeDimsTo2D
to some FC utility file
|
||
// activation weight output_shape | ||
// NCHW CoCHW NCo | ||
// TNC CoC TNCo | ||
// NC CoC NCo | ||
VectorDims outputShape(out_rank, 1); | ||
// set Co | ||
outputShape.back() = weightShape[0]; | ||
outputShape.back() = std::accumulate(weightShape.begin(), weightShape.end() - 1, 1, std::multiplies<Dim>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering does FullyConnected::validate_and_infer_types() always return 2D shape on the output?
@@ -96,7 +96,9 @@ endif() | |||
endfunction() | |||
|
|||
if(ENABLE_CPU_SPECIFIC_TARGET_PER_TEST) | |||
create_target_per_test_for_directory(${CMAKE_CURRENT_SOURCE_DIR}/custom/subgraph_tests/src ov_cpu_func_subgraph) | |||
create_target_per_test_for_directory(${CMAKE_CURRENT_SOURCE_DIR}/custom/subgraph_tests/src ov_cpu_func_subgraph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different line alignemt
}, | ||
has_single_consumer); | ||
|
||
auto m_fc_c = ov::pass::pattern::wrap_type<ov::op::internal::FullyConnectedCompressed>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about compressed FC with zero-points input. Will it be matched?
}; | ||
|
||
CPU_REGISTER_PASS_X64(manager, pass::ConvertFullyConnectedToFullyConnectedCompressed, | ||
supported_compression_types, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Original graph optimizer pass has limitation on activations precision. Is it handled inside the transformation?
- I also don't see limitation on avx2 precision.
- https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/graph_optimizer.cpp#L436-L449 seems to be missed.
- We use place HW specific limitations inside isSupportedOperation function. Worth to align the behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we need to cover new transformations by unit tests
@@ -21,6 +21,7 @@ | |||
// Common transformations | |||
#include "transformations/common_optimizations/mark_precision_sensitive_shapeof_subgraphs.hpp" | |||
#include "transformations/common_optimizations/add_fake_quantize_fusion.hpp" | |||
#include "transformations/common_optimizations/reshape_prelu.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless include
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cleanup this file: it contains some debug code
ov::element::f4e2m1, | ||
}; | ||
|
||
CPU_REGISTER_PASS_X64(manager, pass::ConvertFullyConnectedToFullyConnectedCompressed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, ideally the conversion to Compressed FC should happen at the beginning of the whole CPU transformation pipeline (as it is done for CompressedGather
).
Do I understand correctly that the conversion will be moved in follow-up PR?
@@ -2,7 +2,8 @@ | |||
// SPDX-License-Identifier: Apache-2.0 | |||
// | |||
|
|||
#include "transformations/cpu_opset/common/op/fully_connected.hpp" | |||
// #include "transformations/cpu_opset/common/op/fully_connected.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// #include "transformations/cpu_opset/common/op/fully_connected.hpp" |
@@ -12,7 +12,8 @@ | |||
#include "openvino/pass/pattern/op/or.hpp" | |||
#include "transformations/rt_info/dequantization_node.hpp" | |||
#include "transformations/cpu_opset/common/op/power_static.hpp" | |||
#include "transformations/cpu_opset/common/op/fully_connected.hpp" | |||
#include "ov_ops/fully_connected.hpp" | |||
// #include "transformations/cpu_opset/common/op/fully_connected.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// #include "transformations/cpu_opset/common/op/fully_connected.hpp" |
fc_input_b = transpose->clone_with_new_inputs({fc_input_b->output(0), transpose_const}); | ||
ov::disable_constant_folding(fc_input_b); | ||
result_nodes.push_back(fc_input_b); | ||
fc_input_scale = transpose->clone_with_new_inputs({scale->output(0), transpose_const}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of scalar scale we don't need to transpose it. Should we add the corresponding check as it is done for zero point?
return in_ps.rank().is_static() && out_ps.rank().is_static() && in_ps.size() == 3 && out_ps.size() == 2; | ||
}; | ||
|
||
auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the existing predicate instead of custom one:
auto weights_m = wrap_type<ov::op::v0::Constant>(compressed_constant); | |
auto weights_m = wrap_type<ov::op::v0::Constant>(ov::pass::pattern::type_matches_any(supported_compression_types)); |
auto quantized_weights = [](const ov::Output<ov::Node>& output) { | ||
return output.get_element_type() == ov::element::i8; | ||
}; | ||
|
||
auto quantized_activations = [](const ov::Output<ov::Node>& output) { | ||
return output.get_element_type() == ov::element::u8 || output.get_element_type() == ov::element::i8; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use type_matches
and type_matches_any
instead of custom predicates
(std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr && | ||
std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write much shorter:
(std::dynamic_pointer_cast<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) == nullptr && | |
std::dynamic_pointer_cast<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)) == nullptr)) { | |
(!ov::is_type<const ov::op::v0::Constant>(fc->get_input_node_shared_ptr(BIAS)) && | |
!ov::is_type<const ov::op::internal::Placeholder>(fc->get_input_node_shared_ptr(BIAS)))) { |
: FullyConnected(OutputVector(arguments.begin(), arguments.begin() + 3), output_type) { | ||
for (size_t i = 3; i < arguments.size(); i++) { | ||
set_argument(i, arguments[i]); | ||
} | ||
validate_and_infer_types(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't really like constructors from OutputVector
. This is not safe enough, since they allow user to pass vector with any size to the constructor, and e.g. if arguments.size() < 3
, we read out of memory here, so we can get hardly understandable error in runtime.
I understand benefits of this constructor: it allows to make some other class methods shorter. Maybe we can make this constructor not public then? And make sure before each constructor creation that arguments size is applicable (it seems like this is true for all the existing use cases)
This PR will be closed in a week because of 2 weeks of no activity. |
Details:
Introduce the following operations to the internal opset
FullyConnected
(MatMul
with transposed constant second input)FullyConnectedCompressed
(FullyConnected
with weights compression)FullyConnectedQuantizedLegacy
(FullyConnected
with quantized activations and weights and dequantize scale and zero point pulled through the Op by LPT)FullyConnectedQuantized
(FullyConnected
with quantization scales and zero points on activation, weights and outputs). Planned to be used in scope of dynamic quantization. Can be used for a static quantization as well in the future.Placeholder
(a NoOp input, which replaces a default / optional value for an operation, i.e. Bias forFullyConnected
)The following transformations were added / updated:
ConvertFullyConnectedToFullyConnectedCompressed
(replaces proprietary)FuseFCAndWeightsDecompression
ConvertFCToFCQuantizedLegacy
replaces proprietaryFuseConvMatmulFCDeconvAndDQScales
FullyConnectedBiasFusion
(added into CPU folder for now, needs to be checked and review by GPU team before adaptation to internal opset). Replaces proprietaryFuseConvolutionMatMulDeconvAndBias
ConvertMatMulToFC
updated to useov::op::internal:FullyConnected
, planned to be moved to internal opset after review from GPU teamTickets: