Closed
Description
architecture
- frontend to mark the subgraphs that should be optimized by x engine
- inference preparation to get the subgraph, change program desc by replacing the subgraph with an x engine op
- engine op to build an x engine and execute it like a normal operator.
phrases
frontend
- manually partition graph
- TODO LATTER automatically partition graph
Some initital ideas, just add some special with-block
just for inference
a = op0(b, c)
with infer.accelerate_by_tensorrt:
a1 = op1(b, c)
a2 = op1(b, c)
c = op2(b, c)
with infer.accelerate_by_some_other_engine:
a3 = op3(c)
...
backend
- inference prepare
- partition graph and transform the infer program desc
- tensorrt_engine_op.build
- convert: input subgraph's block desc, add TensorRT layer into TensorRT engine
- transform weight format from fluid to tensorrt
- add tensorrt layer
- convert: input subgraph's block desc, add TensorRT layer into TensorRT engine
- inference execute
- for op in the infer program desc
- op.run
- for op in the infer program desc
x engine
- get a subgraph's block desc (from attribute)
- build x engine once
- execute x engine any times
x op
- construct x op from x network
convert
construct x network from a subgraph's block desc