【KV260视觉入门套件试用体验】硬件加速之—使用PL加速矩阵乘法运算（Vitis HLS）

四、硬件加速之—使用PL加速矩阵乘法运算（Vitis HLS）

前四期测评计划：

一、开箱报告，KV260通过网线共享PC网络

二、Zynq超强辅助-PYNQ配置，并使用XVC(Xilinx Virtual Cable)调试FPGA逻辑

三、硬件加速之—使用PL加速FFT运算（Vivado）

四、硬件加速之—使用PL加速矩阵乘法运算（Vitis HLS）

后四期测评计划：

五、Vitis AI 构建开发环境，并使用inspector检查模型

六、Vitis AI 进行模型校准和来量化

七、Vitis AI 通过迁移学习训练自定义模型

八、Vitis AI 将自定义模型编译并部署到KV260中

矩阵乘法是一种基本的数学运算，它可以用来表示和处理各种线性变换，如旋转、缩放、投影、仿射变换等。矩阵乘法在计算机科学领域有着非常广泛的应用，例如图像处理、机器学习、数据挖掘、密码学、信息检索等。矩阵乘法的快速算法对科学计算有着极为重要的意义。

使用KV260硬件加速矩阵乘法运算，会带来以下好处：

KV260中PL（FPGA）是一种可编程的逻辑器件，能够实现高度定制化和流水线化的运算,充分利用矩阵乘法的并行特征。
与CPU串行执行指令不同，FPGA可以将大规模矩阵乘法拆分成大量细粒度运算,并行 pipelines 执行。
FPGA上的定制逻辑电路不需要频繁访问内存，延迟更低。

综合来说，基于FPGA的矩阵乘法加速器，运算效率可以比通用CPU提高一个数量级以上，非常适合大规模科学计算和深度学习应用。

本文主旨

通过Xilinx Vitis HLS设计一个高效矩阵乘法kernel，使用AXI4full总线，访问DDR中的矩阵A、B，将最终的结果返回矩阵C。矩阵A、B数据通过ARM Core生成，并通过AXI4Lite接口将内存地址信息传递给HLS kernel。

作为对比，我在ARM Core端使用numpy.dot()函数测试一个不使用硬件加速的矩阵乘法的运算，比较两者的差异。

系统框图如下：

Vitis HLS工程：

本次的重点就是HLS工程了：

其中，矩阵乘法的核心代码如下：

loop_count:
    for (int i = 0; i < rep_count; i++) {
    arraypart1:
        for (int row = 0; row < size; row++) {
        arraypart2:
            for (int col = 0; col < size; col++) {
            arraypart3:
                for (int j = 0; j < MAX_SIZE; j++) {
                    int result = (col == 0) ? 0 : temp_sum[j];
                    result += A[row][col] * B[col][j];
                    temp_sum[j] = result;
                    if (col == size - 1) C[row][j] = result;
                }
            }
        }
    }

对于这个循环的各个变量解释如下：

rep_count：矩阵重复次数，用于评估多次运算以积累较长的时间

row, col：待求矩阵C的对应元素的行列，矩阵维度为N

j：控制每个元素的求解次数，比如C(0,0)需要进行三次运算求得

为了方便理解该循环的求解过程，我作了如下示意图：

此处注意一个问题，我们是否可以直接计算每个元素？

比如：C(0,0) = A(0,0)*B(0,0) + A(0,1)*B(1,0) + A(0,2)*B(2,0);

答：如果直接求解每个元素，编译器也不会报错，但是这样做可能会降低代码的性能和资源利用率。因为如果直接求解C的每个元素，那么需要在每次循环中访问A和B的所有元素，这会增加内存访问的次数和延迟。而如果使用一个循环来累加中间结果，并且将B和C数组沿着第二维完全划分，那么你可以利用数组划分后的并行度，减少内存访问的次数和延迟。这样做可以提高代码的吞吐量和效率。

总之，答案是否定的，会带来如下问题：

如果矩阵的维度变化，就需要修改代码;
加法器和乘法器资源浪费；
增加关键路径长度，延迟较大；

完整Vitis HLS代码如下：

#include <stdio.h>
#include <string.h>

#define MAX_SIZE 50

const unsigned int c_dim = MAX_SIZE;

extern "C" {
void matmul_partition(int* in1, int* in2, int* out_r, int size, int rep_count) {
#pragma HLS interface m_axi     port = in1   bundle = gmem0 offset = slave
#pragma HLS interface s_axilite port = in1   bundle = control
#pragma HLS interface m_axi     port = in2   bundle = gmem0 offset = slave
#pragma HLS interface s_axilite port = in2   bundle = control
#pragma HLS interface m_axi     port = out_r bundle = gmem0 offset = slave
#pragma HLS interface s_axilite port = out_r bundle = control
#pragma HLS interface s_axilite port = size 	 bundle = control
#pragma HLS interface s_axilite port = rep_count bundle = control
#pragma HLS interface s_axilite port = return    bundle = control

    int A[MAX_SIZE][MAX_SIZE];
    int B[MAX_SIZE][MAX_SIZE];
    int C[MAX_SIZE][MAX_SIZE];
    int temp_sum[MAX_SIZE];

#pragma HLS ARRAY_PARTITION variable = B dim = 2 complete
#pragma HLS ARRAY_PARTITION variable = C dim = 2 complete
#pragma HLS ARRAY_PARTITION variable = temp_sum dim = 1 complete

read_A:
    for (int itr = 0, i = 0, j = 0; itr < size * size; itr++, j++) {
#pragma HLS LOOP_TRIPCOUNT min = c_dim* c_dim max = c_dim * c_dim
        if (j == size) {
            j = 0;
            i++;
        }
        A[i][j] = in1[itr];
    }

read_B:
    for (int itr = 0, i = 0, j = 0; itr < size * size; itr++, j++) {
#pragma HLS LOOP_TRIPCOUNT min = c_dim* c_dim max = c_dim * c_dim
        if (j == size) {
            j = 0;
            i++;
        }
        B[i][j] = in2[itr];
    }

loop_count:
    for (int i = 0; i < rep_count; i++) {
    arraypart1:
        for (int row = 0; row < size; row++) {
#pragma HLS LOOP_TRIPCOUNT min = c_dim max = c_dim
        arraypart2:
            for (int col = 0; col < size; col++) {
#pragma HLS LOOP_TRIPCOUNT min = c_dim max = c_dim
            arraypart3:
                for (int j = 0; j < MAX_SIZE; j++) {
#pragma HLS LOOP_TRIPCOUNT min = c_dim max = c_dim
                    int result = (col == 0) ? 0 : temp_sum[j];
                    result += A[row][col] * B[col][j];
                    temp_sum[j] = result;
                    if (col == size - 1) C[row][j] = result;
                }
            }
        }
    }

writeC:
    for (int itr = 0, i = 0, j = 0; itr < size * size; itr++, j++) {
#pragma HLS LOOP_TRIPCOUNT min = c_dim* c_dim max = c_dim * c_dim
        if (j == size) {
            j = 0;
            i++;
        }
        out_r[itr] = C[i][j];
    }
}
}

其中最关键的性能指标如下，所有循环II=1。

+ Performance & Resource Estimates: 
    
    PS: '+' for module; 'o' for loop; '*' for dataflow
    +---------------------------------------------------------------+------+------+---------+-----------+----------+---------+------+----------+-----------+-----------+------------+-------------+-----+
    |                            Modules                            | Issue|      | Latency |  Latency  | Iteration|         | Trip |          |           |           |            |             |     |
    |                            & Loops                            | Type | Slack| (cycles)|    (ns)   |  Latency | Interval| Count| Pipelined|   BRAM    |    DSP    |     FF     |     LUT     | URAM|
    +---------------------------------------------------------------+------+------+---------+-----------+----------+---------+------+----------+-----------+-----------+------------+-------------+-----+
    |+ matmul_partition                                             |     -|  0.00|        -|          -|         -|        -|     -|        no|  108 (37%)|  175 (14%)|   4337 (1%)|  12522 (10%)|    -|
    | + matmul_partition_Pipeline_loop_count_arraypart1_arraypart2  |     -|  0.39|        -|          -|         -|        -|     -|        no|          -|  151 (12%)|  2096 (~0%)|    5277 (4%)|    -|
    |  o loop_count_arraypart1_arraypart2                           |     -|  7.30|        -|          -|         5|        1|     -|       yes|          -|          -|           -|            -|    -|
    | + matmul_partition_Pipeline_read_A                            |     -|  0.00|     2504|  2.504e+04|         -|     2504|     -|        no|          -|    1 (~0%)|   161 (~0%)|    288 (~0%)|    -|
    |  o read_A                                                     |     -|  7.30|     2502|  2.502e+04|         4|        1|  2500|       yes|          -|          -|           -|            -|    -|
    | + matmul_partition_Pipeline_read_B                            |     -|  0.00|     2503|  2.503e+04|         -|     2503|     -|        no|          -|          -|   148 (~0%)|    288 (~0%)|    -|
    |  o read_B                                                     |     -|  7.30|     2501|  2.501e+04|         3|        1|  2500|       yes|          -|          -|           -|            -|    -|
    | + matmul_partition_Pipeline_writeC                            |     -|  0.00|     2503|  2.503e+04|         -|     2503|     -|        no|          -|          -|   166 (~0%)|    511 (~0%)|    -|
    |  o writeC                                                     |     -|  7.30|     2501|  2.501e+04|         3|        1|  2500|       yes|          -|          -|           -|            -|    -|
    +---------------------------------------------------------------+------+------+---------+-----------+----------+---------+------+----------+-----------+-----------+------------+-------------+-----+

运行C综合：Run C Synthesis → Export RTL

Vivado工程

完成Vitis工程，接下来我们在Vivado中调用此IP。

首先要在Vivado的IP管理器中，添加我们在上一步中生成的HLS模块，添加到自定义的IP核，如下：

然后按照下图创建我们的Block Design，并最终完成编译生成bitstream。

同样的，如同上一讲中提到的，在生成bitstream后，我们需要两个文件：

mul.bit

mul.hwh

获得方法请参考上一讲。

PYNQ调用，并测试性能

在pynq中，新建Notebook：matmul.ipynb，并复制上一步生成的文件：

首先导入必要的包：

import numpy as np
import cProfile
from pynq import Overlay, allocate
from pynq.lib.debugbridge import DebugBridge

然后加载自定义overlay：

ovmul = Overlay('./mul.bit')
ovmul.ip_dict

可以看到AXI总线下挂载的所有IP，有三个，和Vivado工程一致。

首先定义ps进行矩阵乘法的运算，使用numpy中的dot()函数。

N = 50
rep_cont = 5000

np.random.seed(0)
A = np.random.randint(0, 256, size=(N,N), dtype=np.uint32)
B = np.random.randint(0, 256, size=(N,N), dtype=np.uint32)

def matmul_ps ():
    for i in range(rep_cont):
        C = np.dot(A, B)

然后调用pl中的HLS核进行加速运算：

in1_1 = allocate(shape=(N, N), dtype=np.uint32)
in1_2 = allocate(shape=(N, N), dtype=np.uint32)
out_r = allocate(shape=(N, N), dtype=np.uint32)
np.copyto(in1_1, A)
np.copyto(in1_2, B)

def matmul_pl ():
    matmul = ovmul.matmul_partition_0
    matmul.register_map.in1_1 = np.uint32(in1_1.device_address)       #低32位
    matmul.register_map.in1_1 = np.uint32(in1_1.device_address) >> 32 #高32位
    matmul.register_map.in2_2 = np.uint32(in1_1.device_address)
    matmul.register_map.in2_2 = np.uint32(in1_1.device_address) >> 32
    matmul.register_map.out_r = np.uint32(in1_1.device_address)
    matmul.register_map.out_r = np.uint32(in1_1.device_address) >> 32
    matmul.register_map.size = N
    matmul.register_map.rep_count = rep_cont
    matmul.register_map.CTRL.AP_START = 1
    while matmul.register_map.CTRL.AP_DONE == 0:()

所以条件准备完毕，下面开始测试啦：

cProfile.run ('matmul_ps ()')
---
5004 function calls in 2.835 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.830    2.830    2.835    2.835 2661157745.py:6(matmul_ps)
        1    0.000    0.000    2.835    2.835 <string>:1(<module>)
     5000    0.005    0.000    0.005    0.000 multiarray.py:741(dot)
        1    0.000    0.000    2.835    2.835 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

执行5000次矩阵运算，ps端耗时:2.835秒

cProfile.run ('matmul_pl ()')
---
43912 function calls in 0.189 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.035    0.035    0.189    0.189 1250270229.py:7(matmul_pl)
        1    0.000    0.000    0.189    0.189 <string>:1(<module>)
        1    0.001    0.001    0.001    0.001 overlay.py:357(__getattr__)
        1    0.000    0.000    0.000    0.000 overlay.py:464(is_loaded)
     4392    0.008    0.000    0.012    0.000 overlay.py:765(register_map)
     4383    0.054    0.000    0.136    0.000 registers.py:135(__getitem__)
        7    0.000    0.000    0.000    0.000 registers.py:165(__setitem__)
        1    0.000    0.000    0.000    0.000 registers.py:202(_reordered_setitem)
     4390    0.005    0.000    0.005    0.000 registers.py:219(_debug)
     4390    0.038    0.000    0.044    0.000 registers.py:28(_calc_index)
        6    0.000    0.000    0.000    0.000 registers.py:378(_set_value)
     4384    0.005    0.000    0.005    0.000 registers.py:381(_get_value)
        1    0.000    0.000    0.189    0.189 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
     4392    0.003    0.000    0.003    0.000 {built-in method builtins.hasattr}
     4390    0.004    0.000    0.004    0.000 {built-in method builtins.hex}
     8780    0.006    0.000    0.006    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     4390    0.030    0.000    0.030    0.000 {method 'format' of 'str' objects}