问题描述及复现步骤:
在简单的自定网络结构中,需要实现(B, C, H, W) 降维成 (B, C, W)
使用ReduceMax OP + Reshape OP来实现此功能,发现ReduceMax OP是在CPU上运行的,耗时很大(约140ms)。
在rk3588开发板上实测结果如下:
D RKNN: [11:25:59.947] ID OpType DataType Target InputShape OutputShape DDR Cycles NPU Cycles Total Cycles Time(us) MacUsage(%) RW(KB) FullName
D RKNN: [11:25:59.947] 0 InputOperator INT8 CPU \ (1,10,32,10000) 0 0 0 4 \ 5000.00 InputOperator:voxels_input
D RKNN: [11:25:59.947] 1 ConvRelu INT8 NPU (1,10,32,10000),(64,10,1,1),(64) (1,64,32,10000) 811751 200000 811751 3871 6.89 25001.50 Conv:Conv_0
D RKNN: [11:25:59.947] 2 ReduceMax INT8 CPU (1,64,32,10000) (1,64,1,10000) 0 0 0 139036 \ 20625.00 ReduceMax:ReduceMax_2
D RKNN: [11:25:59.947] 3 Reshape INT8 CPU (1,64,1,10000),(4) (1,64,10000,1) 0 0 0 1048 \ 1250.03 Reshape:Squeeze_3_2reshape
D RKNN: [11:25:59.947] 4 OutputOperator INT8 CPU (1,64,10000,1) \ 0 0 0 40 \ 625.00 OutputOperator:pillar_features
D RKNN: [11:25:59.947] Total Operator Elapsed Time(us): 143999
---
另外,采用Maxpool替换ReduceMax OP,同样发现在CPU上运行,耗时很大(约130ms)。在rk3588开发板上实测结果如下:
D RKNN: [13:11:54.589] ID OpType DataType Target InputShape OutputShape DDR Cycles NPU Cycles Total Cycles Time(us) MacUsage(%) RW(KB) FullName
D RKNN: [13:11:54.589] 0 InputOperator INT8 CPU \ (1,10,32,10000) 0 0 0 4 \ 5000.00 InputOperator:voxels_input
D RKNN: [13:11:54.589] 1 ConvRelu INT8 NPU (1,10,32,10000),(64,10,1,1),(64) (1,64,32,10000) 811751 200000 811751 3873 6.89 25001.50 Conv:Conv_0
D RKNN: [13:11:54.589] 2 MaxPool INT8 CPU (1,64,32,10000) (1,64,1,10000) 0 0 0 130099 \ 20625.00 MaxPool:MaxPool_2
D RKNN: [13:11:54.589] 3 Reshape INT8 CPU (1,64,1,10000),(4) (1,64,10000,1) 0 0 0 779 \ 1250.03 Reshape:Squeeze_3_2reshape
D RKNN: [13:11:54.589] 4 OutputOperator INT8 CPU (1,64,10000,1) \ 0 0 0 28 \ 625.00 OutputOperator:pillar_features
D RKNN: [13:11:54.589] Total Operator Elapsed Time(us): 134783
---
请问能否优化,使得reducemax op在NPU上运行,提高速度。另外,为何使用maxpool op是在CPU上运行而非NPU?