正文

tensorrt安装记录（8.2.5）(代码片段)

太阳花的小绿豆  太阳花的小绿豆  2022-10-23  467

关键词：

官网链接：https://developer.nvidia.com/tensorrt

文章目录

0 TensorRT简介

NVIDIA® TensorRT™ is an SDK for optimizing trained deep learning models to enable high-performance inference. TensorRT contains a deep learning inference optimizer for trained deep learning models, and a runtime for execution. After you have trained your deep learning model in a framework of your choice, TensorRT enables you to run it with higher throughput and lower latency.

根据官方对于TensorRT的介绍可知，TensorRT是一个针对已训练好模型的SDK，通过该SDK能够在NVIDIA的设备上进行高性能的推理。那么TensorRT具体会对我们训练好的模型做哪些优化呢，可以参考TensorRT官网中的一幅图，如下图所示：
总结下来主要有以下6点：

Reduced Precision：将模型量化成INT8或者FP16的数据类型（在保证精度不变或略微降低的前提下），以提升模型的推理速度。
Layer and Tensor Fusion：通过将多个层结构进行融合（包括横向和纵向）来优化GPU的显存以及带宽。
Kernel Auto-Tuning：根据当前使用的GPU平台选择最佳的数据层和算法。
Dynamic Tensor Memory：最小化内存占用并高效地重用张量的内存。
Multi-Stream Execution：使用可扩展设计并行处理多个输入流。
Time Fusion：使用动态生成的核去优化随时间步长变化的RNN网络。

1 安装TensorRT

安装TensorRT建议直接按照官方的教程来，官方最新TensorRT快速开始文档：
https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html
或者指定某一版本的TensorRT快速开始文档（以当前最新稳定版8.2.5为例）：
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-825/quick-start-guide/index.html

对于安装TensorRT官方列出了下面三种安装方式，但我个人还是喜欢TAR Package安装（其他安装方式参考https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html）：

Container Installation
Debian Installation
pip Wheel File Installation

1.1 pip安装（trtexec无法使用）

如果会使用Docker的建议用Container Installation，本文先以pip Wheel File Installation安装方式为例。在官方快速开始文档pip Wheel File Installation中（8.2.5）明确说明Python的版本只支持3.6至3.9，CUDA版本只支持11.x，并且只支持Linux操作系统以及x86_64的CPU架构，官方建议使用Centos 7或者Ubuntu 18.04。

The pip-installable nvidia-tensorrt Python wheel files only support Python versions 3.6 to 3.9 and CUDA 11.x at this time and will not work with other Python or CUDA versions. Only the Linux operating system and x86_64 CPU architecture is currently supported. These wheel files are expected to work on CentOS 7 or newer and Ubuntu 18.04 or newer.

除了以上说的要求外，还需要注意下GPU的驱动版本，因为不同的CUDA版本对GPU的驱动有不同的要求，而这里安装的TensorRT（8.2.5）要求使用CUDA 11.x版本，所以需要看下自己GPU的驱动版本是否满足，可通过nvidia-smi指令查看自己的驱动版本。这里可以直接在NVIDIA官网，看下CUDA版本以及GPU驱动的对应关系：
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

在保证GPU驱动满足要求的前提下，建议先用conda创建一个新的虚拟环境（不影响其他环境）。这里就直接创建一个名为tensorrt的虚拟环境，然后采用python的版本为3.8。

conda create -n tensorrt python=3.8

创建好虚拟环境以后，激活进入虚拟环境：

conda activate tensorrt

接着安装nvidia-pyindex和nvidia-tensorrt，注意，如果不指定nvidia-tensorrt的版本号默认安装最新版本，本文是以8.2.5版本为例，所以这里安装的是当前可用的8.2.5.1：

pip install nvidia-pyindex
pip install nvidia-tensorrt==8.2.5.1

安装完成后，按照官方的步骤检查下是否安装成功，只需进入Python环境，然后简单打印下版本号等信息，只要不报错就说明安装成功。

import tensorrt
print(tensorrt.__version__)
assert tensorrt.Builder(tensorrt.Logger())

但后面按照官方教程使用trtexec转换模型格式时发现找不到这个工具，我怀疑通过pip安装方式只是安装了TensorRT的运行时，没有提供trtexec工具。

1.2 TAR Package安装

安装过程主要按照官网流程：https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-tar
在安装之前需要准备好以下环境：

CUDA 10.2, 11.0 update 1, 11.1 update 1, 11.2 update 2, 11.3 update 1, 11.4 update 3, 11.5 update 1 or 11.6
cuDNN 8.3.2
Python 3 (Optional)

进入官方TensorRT的下载页面（需要登录）

下载对应的包，这里我下载的是TensorRT 8.2 GA Update 4 for Linux x86_64 and CUDA 11.0, 11.1, 11.2, 11.3, 11.4 and 11.5 TAR Package：

下载完成后解压文件：

tar -xzvf TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz

解压后会生成TensorRT-8.2.5.1文件夹，接着将TensorRT-8.2.5.1/lib文件夹路径添加到环境变量LD_LIBRARY_PATH中，注意我是将TensorRT-8.2.5.1文件夹放在root路径下所以设置的是/root/TensorRT-8.2.5.1/lib，这里需要根据自己解压的路径设置。同理将TensorRT-8.2.5.1/bin文件夹路径添加到环境变量PATH中，其中包含后面需要用到的trtexec工具：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/TensorRT-8.2.5.1/lib
export PATH=$PATH:/root/TensorRT-8.2.5.1/bin

接着进入TensorRT-8.2.5.1/python文件夹下安装TensorRT wheel文件，在该文件夹里有针对不同python版本的whl文件，由于我采用的虚拟环境中的python版本是3.8所以安转cp38对应的whl文件：

cd TensorRT-8.2.5.1/python
pip install tensorrt-8.2.5.1-cp38-none-linux_x86_64.whl

接着进入TensorRT-8.2.5.1/graphsurgeon文件夹下安装graphsurgeon wheel文件：

cd TensorRT-8.2.5.1/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl

接着进入TensorRT-8.2.5.1/onnx-graphsurgeon文件夹下安装onnx-graphsurgeon wheel文件：

cd TensorRT-8.2.5.1/onnx-graphsurgeon
pip install onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl

安转完后，同样可以进入Python环境，然后简单打印下版本号等信息，只要不报错就说明安装成功。

import tensorrt
print(tensorrt.__version__)
assert tensorrt.Builder(tensorrt.Logger())

2 将模型转换成TensorRT的流程

根据官网的介绍，转换TensorRT的工作流程主要有以下6个步骤：

Export the Model: 导出模型
Select A Batch Size: 根据自己的实际项目选择一个合适的Batch Size
Select A Precision: 选择一个精度类型，比如INT8，FLOAT16，FLOAT32
Convert The Model: 转换模型
Deploy The Model: 部署模型

那哪些格式的模型能够导出并转换成TensorRT模型呢，官方提到了三种方式：

using TF-TRT：使用TF-TRT(TensorFlow-TensorRT )
automatic ONNX conversion from .onnx files：从ONNX通用格式转换得到（注意，这里需要自己提前将模型转成ONNX格式）
manually constructing a network using the TensorRT API (either in C++ or Python)：自己用TensorRT API构建模型（这个对新人不太友好，难度有点大）

也可以参考下面这幅图，比如说对于Pytorch的模型，我们一般需要先转成ONNX通用格式，然后再转成TensorRT模型，最后部署的时候可以选择C++或者Python：

3 将Pytorch模型转成TensorRT案例

按照上述内容，我们知道一般将Pytorch模型转成TensorRT格式的流程是先转ONNX通用格式，再转TensorRT。

3.1 将Pytorch模型转成ONNX格式

这里以Pytorch官方提供的ResNet34为例，直接从torchvision中实例化ResNet34并载入自己在flower_photos数据集上训练好的权重，然后在转成ONNX格式，示例代码如下：

import torch
import torch.onnx
import onnx
import onnxruntime
import numpy as np
from torchvision.models import resnet34

device = torch.device("cpu")


def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()


def main():
    weights_path = "resNet34(flower).pth"
    onnx_file_name = "resnet34.onnx"
    batch_size = 1
    img_h = 224
    img_w = 224
    img_channel = 3

    # create model and load pretrain weights
    model = resnet34(pretrained=False, num_classes=5)
    model.load_state_dict(torch.load(weights_path, map_location='cpu'))

    model.eval()
    # input to the model
    # [batch, channel, height, width]
    x = torch.rand(batch_size, img_channel, img_h, img_w, requires_grad=True)
    torch_out = model(x)

    # export the model
    torch.onnx.export(model,             # model being run
                      x,                 # model input (or a tuple for multiple inputs)
                      onnx_file_name,    # where to save the model (can be a file or file-like object)
                      input_names=["input"],
                      output_names=["output"],
                      verbose=False)

    # check onnx model
    onnx_model = onnx.load(onnx_file_name)
    onnx.checker.check_model(onnx_model)

    ort_session = onnxruntime.InferenceSession(onnx_file_name)

    # compute ONNX Runtime output prediction
    ort_inputs = ort_session.get_inputs()[0].name: to_numpy(x)
    ort_outs = ort_session.run(None, ort_inputs)

    # compare ONNX Runtime and Pytorch results
    # assert_allclose: Raises an AssertionError if two objects are not equal up to desired tolerance.
    np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)
    print("Exported model has been tested with ONNXRuntime, and the result looks good!")


if __name__ == '__main__':
    main()

注意，这里将Pytorch模型转成ONNX后，又利用ONNXRUNTIME载入导出的模型，然后输入同样的数据利用np.testing.assert_allclose方法对比转换前后输出的差异，其中rtol代表相对偏差，atol代表绝对偏差，如果两者的差异超出指定的精度则会报错。在转换后，会在当前文件夹中生成一个resnet34.onnx文件。

3.2 将ONNX格式转成TensorRT格式

将ONNX转成TensorRT engine的方式有多种，其中最简单的就是使用trtexec工具。在上面3.1章节中已经将Pyotrch中的Resnet34转成ONNX格式了，接下来可以直接使用trtexec工具将其转为TensorRT engine格式：

trtexec --onnx=resnet34.onnx --saveEngine=trt_output/resnet34.trt

其中：

--onnx是指向生成的onnx模型文件路径
--saveEngine是保存TensorRT engine的文件路径（发现一个小问题，就是保存的目录必须提前创建好，如果没有创建的话就会报错）

转化过程中终端会输出如下信息：

[06/23/2022-08:08:14] [I] === Model Options ===
[06/23/2022-08:08:14] [I] Format: ONNX
[06/23/2022-08:08:14] [I] Model: /root/project/resnet34.onnx
[06/23/2022-08:08:14] [I] Output:
[06/23/2022-08:08:14] [I] === Build Options ===
[06/23/2022-08:08:14] [I] Max batch: explicit batch
[06/23/2022-08:08:14] [I] Workspace: 16 MiB
[06/23/2022-08:08:14] [I] minTiming: 1
[06/23/2022-08:08:14] [I] avgTiming: 8
[06/23/2022-08:08:14] [I] Precision: FP32
[06/23/2022-08:08:14] [I] Calibration:
[06/23/2022-08:08:14] [I] Refit: Disabled
[06/23/2022-08:08:14] [I] Sparsity: Disabled
[06/23/2022-08:08:14] [I] Safe mode: Disabled
[06/23/2022-08:08:14] [I] DirectIO mode: Disabled
[06/23/2022-08:08:14] [I] Restricted mode: Disabled
[06/23/2022-08:08:14] [I] Save engine: trt_ouput/resnet34.trt
[06/23/2022-08:08:14] [I] Load engine:
[06/23/2022-08:08:14] [I] Profiling verbosity: 0
[06/23/2022-08:08:14] [I] Tactic sources: Using default tactic sources
[06/23/2022-08:08:14] [I] timingCacheMode: local
[06/23/2022-08:08:14] [I] timingCacheFile:
[06/23/2022-08:08:14] [I] Input(s)s format: fp32:CHW
[06/23/2022-08:08:14] [I] Output(s)s format: fp32:CHW
[06/23/2022-08:08:14] [I] Input build shapes: model
[06/23/2022-08:08:14] [I] Input calibration shapes: model
......
[06/23/2022-08:08:41] [I] === Performance summary ===
[06/23/2022-08:08:41] [I] Throughput: 550.406 qps
[06/23/2022-08:08:41] [I] Latency: min = 1.85938 ms, max = 2.23706 ms, mean = 1.87513 ms, median = 1.87372 ms, percentile(99%) = 1.90234 ms
[06/23/2022-08:08:41] [I] End-to-End Host Latency: min = 1.87573 ms, max = 3.56226 ms, mean = 3.38754 ms, median = 3.47742 ms, percentile(99%) = 3.50659 ms
[06/23/2022-08:08:41] [I] Enqueue Time: min = 0.402954 ms, max = 2.53369 ms, mean = 0.68202 ms, median = 0.653564 ms, percentile(99%) = 0.830811 ms
[06/23/2022-08:08:41] [I] H2D Latency: min = 0.0581055 ms, max = 0.0943298 ms, mean = 0.063807 ms, median = 0.0615234 ms, percentile(99%) = 0.0910645 ms
[06/23/2022-08:08:41] [I] GPU Compute Time: min = 1.79099 ms, max = 2.14551 ms, mean = 1.80203 ms, median = 1.80127 ms, percentile(99%) = 1.8125 ms
[06/23/2022-08:08:41] [I] D2H Latency: min = 0.00610352 ms, max = 0.0129395 ms, mean = 0.00928149 ms, median = 0.00949097 ms, percentile(99%) = 0.0119934 ms
[06/23/2022-08:08:41] [I] Total Host Walltime: 3.00324 s
[06/23/2022-08:08:41] [I] Total GPU Compute Time: 2.97876 s
[06/23/2022-08:08:41] [I] Explanations of the performance metrics are printed in the verbose logs.

有关trtexec工具的使用方法，可以通过trtexec --help查看详细介绍，比如要使用FP16精度转模型时加上--fp16参数即可。

3.3 载入TensorRT模型

这里主要参考官方提供的notebook教程：https://github.com/NVIDIA/TensorRT/blob/main/quickstart/SemanticSegmentation/tutorial-runtime.ipynb

下面是我参考官方demo写的一个样例，在样例中对比ONNX和TensorRT的输出结果。

import numpy as np
import tensorrt as trt
import onnxruntime
import pycuda.driver as cuda
import pycuda.autoinit


def normalize(image: np.ndarray) -> np.ndarray:
    """
    Normalize the image to the given mean and standard deviation
    """
    image = image.astype(np.float32)
    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)
    image /= 255.0
    image -= mean
    image /= std
    return image


def onnx_inference(onnx_path: str, image: np.ndarray):
    # load onnx model
    ort_session = onnxruntime.InferenceSession(onnx_path)

    # compute onnx Runtime output prediction
    ort_inputs = ort_session.get_inputs()[0].name: image
    res_onnx = ort_session.run(None, ort_inputs)[0]
    return res_onnx


def trt_inference(trt_path: str, image: np.ndarray):
    # Load the network in Inference Engine
    trt_logger = trt.Logger(trt.Logger.WARNING)
    with open(trt_path, "rb") as f, trt.Runtime(trt_logger) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())

    with engine.create_execution_context() as context:
        # Set input shape based on image dimensions for inference
        context.set_binding_shape(engine.get_binding_index("input"), (1, 3, image.shape[-2], image.shape[-1]))
        # Allocate host and device buffers
        bindings = []
        for binding in engine:
            binding_idx = engine.get_binding_index(binding)
            size = trt.volume(context.get_binding_shape(binding_idx))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            if engine.binding_is_input(binding):
                input_buffer = np.ascontiguousarray(image)
                input_memory = cuda.mem_alloc(image.nbytes)
                bindings.append(int(input_memory))
            else:
                output_buffer = cuda.pagelocked_empty(size, dtype)
                output_memory = cuda.mem_alloc(output_buffer.nbytes)
                bindings.append(int(output_memory))

        stream = cuda.Stream()
        # Transfer input data to the GPU.
        cuda.memcpy_htod_async(input_memory, input_buffer, stream)
        # Run inference
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer prediction output from the GPU.
        cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
        # Synchronize the stream
        stream.synchronize()

        res_trt = np.reshape(output_buffer, (1, -1))

    return res_trt


def main():
    image_h = 224
    image_w = 224
    onnx_path = "resnet34.onnx"
    trt_path = "trt_output/resnet34.trt"

    image = np.random.randn(image_h, image_w, 3)
    normalized_image = normalize(image)

    # Convert the resized images to network input shape
    # [h, w, c] -> [c, h, w] -> [1, c, h, w]
    normalized_image = np.expand_dims(np.transpose(normalized_image, (2, 0, 1)), 0)

    onnx_res = onnx_inference(onnx_path, normalized_image)
    ir_res = trt_inference(trt_path, normalized_image)
    np.testing.assert_allclose(onnx_res, ir_res, rtol=1e-03, atol=1e-05)
    print("Exported model has been tested with TensorRT Runtime, and the result looks good!")


if __name__ == '__main__':
    main()

3.4 其他

最后提下模型的量化，关于量化可以简单的分成两类（不严谨）：QAT（Quantiztion Aware Training）在训练过程中同时进行量化，PTQ（Post Training Quantization）训练后量化。由于现在深度学习框架非常多以及各种runtime（比如tensorflow的tf-lite，pytorch的torchscript，onnx，tensorrt，openvino等等），量化的工具也一堆。这里对于QAT推荐nvidia的pytorch-quantization工具。对于PTQ，如果是部署在nvidia卡上推荐tensorrt，如果部署在cpu上可以尝试openvino。

安装tensorrt(代码片段)

下载对应的tar版本：https://developer.nvidia.com/nvidia-tensorrt-6x-download解压安装包version="6.0.1.8"os="Ubuntu-16.04"arch=$(uname-m)cuda="cuda-10.2"cudnn="cudnn7.6"tarxzvf 查看详情

ai性能优化之tensorrt（1tensorrt简介及安装）(代码片段)

文章目录正文1.NVIDIATensorRT介绍2.TensorRT的安装3.开发文档3-1开发流程3-2pythonapi1）工作流程2）核心元素3）...其他3-3PyTorchdemo3-4ONNXdemo正文1.NVIDIATensorRT介绍https://developer.nvidia.com/zh-cn/tensorrtNVIDIATens 查看详情

ai性能优化之tensorrt（1tensorrt简介及安装）(代码片段)

tensorrt安装教程(代码片段)

简介本文介绍在Ubuntu系统下安装TensorRT的具体步骤，主要支持C++的调用，理论上适合各种Linux发行版。准备工作CUDA安装首先需要确保正确安装CUDA，可以参考我之前的博文，通过nvcc-V验证是否安装。下载TensorRT... 查看详情

win10安装tensorrt(代码片段)

2.1下载TensorRT版本使用TensorRT7.2.3forWindows，CUDA版本是11.0，下载地址：NVIDIATensorRT7.xDownload|NVIDIADeveloper文件大小约500M，2.2配置环境变量将下载好的文件赋值到C盘根目录，然后将“C:\\TensorRT-7.2.3.4\\lib”添查看详情

一安装ubuntu+配置环境+安装tensorrt(代码片段)

最近研究推理加速，涉及到TensorRT，加上之前一直也想搞个Ubuntu系统，所以就打算一步到位，双系统安排。刻录U盘的前期步骤可以参考文章：windows10安装ubuntu双系统教程（绝对史上最详细）-不妨不妨&#x... 查看详情

tensorrt安装与engine生成(代码片段)

1、下载选择合适的TensorRT版本，在官网进行下载下载完后后，需要进行解压：tar-xzvfTensorRT-$version.Linux.$arch-gnu.$cuda.$cudnn.tar.gz添加TensorRTlib的绝对路径到系统环境变量LD_LIBRARY_PATH：exportLD_LIBRARY_PATH=$LD_ 查看详情

tensorrt安装与engine生成(代码片段)

tensorrt安装(代码片段)

下面以TensorRT-8.0.1为例，基于ubuntu20.04安装，默认已经安装GPU驱动、CUDA、cuDNN；TensorRT依赖CUDA、cuDNN的版本，比如TensorRT软件包TensorRT-8.0.1.6.Linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz，依赖于cuda11.3、cudnn8.2，因此操... 查看详情

fastreid模型转为onnx和tensorrt模型(代码片段)

文章目录转onnx模型安装pip库模型转化模型推理转TensorRT模型安装pip文件导出模型推理转onnx模型安装pip库pipinstallonnx-simplifier-ihttps://pypi.tuna.tsinghua.edu.cn/simplepipinstallonnxoptimizer-ihttps://pypi.tuna.tsinghua.edu.cn/simple模型转化py 查看详情

tensorrt-介绍-使用-安装(代码片段)

1简介 TensorRT是一个高性能的深度学习推理（Inference）优化器，可以为深度学习应用提供低延迟、高吞吐率的部署推理。TensorRT可用于对超大规模数据中心、嵌入式平台或自动驾驶平台进行推理加速。TensorRT现已能支持... 查看详情

tensorrt安装教程(代码片段)

ai性能优化之tensorrt（1tensorrt简介及安装）(代码片段)

ubuntu18.04/20.04cv环境配置（中）：tensorrt+pytorch安装配置(代码片段)

...ofHaoWANG的博客-CSDN博客Ubuntu18.0420.04NVIDIACUDA环境配置与cudnnTensorrt等配置与使用https://blog.csdn.net/hhaowang/article/details/125803582?spm=1001.2014.3001.5501目录TensorRT的好处1.版本选择1.1安装CUDA和cudnn 1.2安装Tensorrt2.下载文件3.1Tensorrt编译3.2Pyto... 查看详情

ubuntu18.04/20.04cv环境配置（中）：tensorrt+pytorch安装配置(代码片段)

ubuntu18.04安装cuda11.3和tensorrt8教程（碰到的坑及填坑方法，以及python和c++的tensorrt环境搭建）(代码片段)

1.卸载原环境原环境：显卡驱动--470.57.02 cuda--10.0 cudnn--7.6 TensorRT7.0.0.11卸载cuda10.0以及相应的cudnn：cd/usr/local/cuda/binsudo./uninstall_cuda_10.0.plcd..sudorm-rfcuda-10.02.安装cuda11.3及相应的cudn 查看详情

ubuntu18.04/20.04cv环境配置（中）：tensorrt+pytorch安装配置(代码片段)

Ubuntu18.04/20.04CV环境配置（上）：CUDA11.1+cudnn安装配置_TechblogofHaoWANG的博客-CSDN博客Ubuntu18.0420.04NVIDIACUDA环境配置与cudnnTensorrt等配置与使用https://blog.csdn.net/hhaowang/article/details/12580 查看详情