正文

知识蒸馏nst算法实战：使用coatnet蒸馏resnet18(代码片段)

AI浩  AI浩  2022-12-28  733

关键词：

文章目录

摘要
最终结论
模型
- ResNet18， ResNet34
- CoatNet
数据准备
训练Teacher模型
- 步骤
学生网络
- 步骤
蒸馏学生网络
- 步骤
结果比对
总结

摘要

复杂度的检测模型虽然可以取得SOTA的精度，但它们往往难以直接落地应用。模型压缩方法帮助模型在效率和精度之间进行折中。知识蒸馏是模型压缩的一种有效手段，它的核心思想是迫使轻量级的学生模型去学习教师模型提取到的知识，从而提高学生模型的性能。已有的知识蒸馏方法可以分别为三大类：

基于特征的（feature-based，例如VID、NST、FitNets、fine-grained feature imitation）
基于关系的（relation-based，例如IRG、Relational KD、CRD、similarity-preserving knowledge distillation）
基于响应的（response-based，例如Hinton的知识蒸馏开山之作）

今天我们就尝试用基于关系特征的NST知识蒸馏算法完成这篇实战。NST蒸馏是对模型里面的的Block最后一层Feature做蒸馏，所以需要最后一层block的值。所以我们对模型要做修改来适应NST算法，并且为了使Teacher和Student的网络层之间的参数一致，我们这次选用CoatNet作为Teacher模型，选择ResNet18作为Student。

最终结论

先把结论说了吧！ Teacher网络使用CoatNet的coatnet_2模型，Student网络使用ResNet18。如下表

网络	epochs	ACC
CoatNet	100	91%
ResNet18	100	89%
ResNet18 +NST	100	90%

模型

模型没有用pytorch官方自带的，而是参照以前总结的ResNet模型修改的。ResNet模型结构如下图：

ResNet18， ResNet34

ResNet18， ResNet34模型的残差结构是一致的，结构如下：

代码如下：
resnet.py

import torch
import torchvision
from torch import nn
from torch.nn import functional as F
# from torchsummary import summary


class ResidualBlock(nn.Module):
    """
    实现子module: Residual Block
    """

    def __init__(self, inchannel, outchannel, stride=1, shortcut=None):
        super(ResidualBlock, self).__init__()
        self.left = nn.Sequential(
            nn.Conv2d(inchannel, outchannel, 3, stride, 1, bias=False),
            nn.BatchNorm2d(outchannel),
            nn.ReLU(inplace=True),
            nn.Conv2d(outchannel, outchannel, 3, 1, 1, bias=False),
            nn.BatchNorm2d(outchannel)
        )
        self.right = shortcut

    def forward(self, x):
        out = self.left(x)
        residual = x if self.right is None else self.right(x)
        out += residual
        return F.relu(out)


class ResNet(nn.Module):
    """
    实现主module：ResNet34
    ResNet34包含多个layer，每个layer又包含多个Residual block
    用子module来实现Residual block，用_make_layer函数来实现layer
    """

    def __init__(self, blocks, num_classes=1000):
        super(ResNet, self).__init__()
        self.model_name = 'resnet34'

        # 前几层: 图像转换
        self.pre = nn.Sequential(
            nn.Conv2d(3, 64, 7, 2, 3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(3, 2, 1))

        # 重复的layer，分别有3，4，6，3个residual block
        self.layer1 = self._make_layer(64, 64, blocks[0])
        self.layer2 = self._make_layer(64, 128, blocks[1], stride=2)
        self.layer3 = self._make_layer(128, 256, blocks[2], stride=2)
        self.layer4 = self._make_layer(256, 512, blocks[3], stride=2)

        # 分类用的全连接
        self.fc = nn.Linear(512, num_classes)

    def _make_layer(self, inchannel, outchannel, block_num, stride=1):
        """
        构建layer,包含多个residual block
        """
        shortcut = nn.Sequential(
            nn.Conv2d(inchannel, outchannel, 1, stride, bias=False),
            nn.BatchNorm2d(outchannel),
            nn.ReLU()
        )

        layers = []
        layers.append(ResidualBlock(inchannel, outchannel, stride, shortcut))

        for i in range(1, block_num):
            layers.append(ResidualBlock(outchannel, outchannel))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.pre(x)
        l1_out = self.layer1(x)
        l2_out = self.layer2(l1_out)
        l3_out = self.layer3(l2_out)
        l4_out = self.layer4(l3_out)
        p_out = F.avg_pool2d(l4_out, 7)
        fea = p_out.view(p_out.size(0), -1)
        out=self.fc(fea)
        return l1_out,l2_out,l3_out,l4_out,fea,out

def ResNet18():
    return ResNet([2, 2, 2, 2])


def ResNet34():
    return ResNet([3, 4, 6, 3])


if __name__ == '__main__':
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = ResNet34()
    model.to(device)
    # summary(model, (3, 224, 224))

主要修改了输出结果，将每个block的结果输出出来。

CoatNet

代码：
coatnet.py

import torch
import torch.nn as nn

from einops import rearrange
from einops.layers.torch import Rearrange


def conv_3x3_bn(inp, oup, image_size, downsample=False):
    stride = 1 if downsample == False else 2
    return nn.Sequential(
        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        nn.GELU()
    )


class PreNorm(nn.Module):
    def __init__(self, dim, fn, norm):
        super().__init__()
        self.norm = norm(dim)
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)


class SE(nn.Module):
    def __init__(self, inp, oup, expansion=0.25):
        super().__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(oup, int(inp * expansion), bias=False),
            nn.GELU(),
            nn.Linear(int(inp * expansion), oup, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y


class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)


class MBConv(nn.Module):
    def __init__(self, inp, oup, image_size, downsample=False, expansion=4):
        super().__init__()
        self.downsample = downsample
        stride = 1 if self.downsample == False else 2
        hidden_dim = int(inp * expansion)

        if self.downsample:
            self.pool = nn.MaxPool2d(3, 2, 1)
            self.proj = nn.Conv2d(inp, oup, 1, 1, 0, bias=False)

        if expansion == 1:
            self.conv = nn.Sequential(
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, 3, stride,
                          1, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.GELU(),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
        else:
            self.conv = nn.Sequential(
                # pw
                # down-sample in the first conv
                nn.Conv2d(inp, hidden_dim, 1, stride, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.GELU(),
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, 3, 1, 1,
                          groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.GELU(),
                SE(inp, hidden_dim),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
        
        self.conv = PreNorm(inp, self.conv, nn.BatchNorm2d)

    def forward(self, x):
        if self.downsample:
            return self.proj(self.pool(x)) + self.conv(x)
        else:
            return x + self.conv(x)


class Attention(nn.Module):
    def __init__(self, inp, oup, image_size, heads=8, dim_head=32, dropout=0.):
        super().__init__()
        inner_dim = dim_head * heads
        project_out = not (heads == 1 and dim_head == inp)

        self.ih, self.iw = image_size

        self.heads = heads
        self.scale = dim_head ** -0.5

        # parameter table of relative position bias
        self.relative_bias_table = nn.Parameter(
            torch.zeros((2 * self.ih - 1) * (2 * self.iw - 1), heads))

        coords = torch.meshgrid((torch.arange(self.ih), torch.arange(self.iw)))
        coords = torch.flatten(torch.stack(coords), 1)
        relative_coords = coords[:, :, None] - coords[:, None, :]

        relative_coords[0] += self.ih - 1
        relative_coords[1] += self.iw - 1
        relative_coords[0] *= 2 * self.iw - 1
        relative_coords = rearrange(relative_coords, 'c h w -> h w c')
        relative_index = relative_coords.sum(-1).flatten().unsqueeze(1)
        self.register_buffer("relative_index", relative_index)

        self.attend = nn.Softmax(dim=-1)
        self.to_qkv = nn.Linear(inp, inner_dim * 3, bias=False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, oup),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim=-1)
        q, k, v = map(lambda t: rearrange(
            t, 'b n (h d) -> b h n d', h=self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        # Use "gather" for more efficiency on GPUs
        relative_bias = self.relative_bias_table.gather(
            0, self.relative_index.repeat(1, self.heads))
        relative_bias = rearrange(
            relative_bias, '(h w) c -> 1 c h w', h=self.ih*self.iw, w=self.ih*self.iw)
        dots = dots + relative_bias

        attn = self.attend(dots)
        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        out = self.to_out(out)
        return out


class Transformer(nn.Module):
    def __init__(self, inp, oup, image_size, heads=8, dim_head=32, downsample=False, dropout=0.):
        super().__init__()
        hidden_dim = int(inp * 4)

        self.ih, self.iw = image_size
        self.downsample = downsample

        if self.downsample:
            self.pool1 = nn.MaxPool2d(3, 2, 1)
            self.pool2 = nn.MaxPool2d(3, 2, 1)
            self.proj = nn.Conv2d(inp, oup, 1, 1, 0, bias=False)

        self.attn = Attention(inp, oup, image_size, heads, dim_head, dropout)
        self.ff = FeedForward(oup, hidden_dim, dropout)

        self.attn = nn.Sequential(
            Rearrange('b c ih iw -> b (ih iw) c'),
            PreNorm(inp, self.attn, nn.LayerNorm),
            Rearrange('b (ih iw) c -> b c ih iw', ih=self.ih, iw=self.iw)
        )

        self.ff = nn.Sequential(
            Rearrange('b c ih iw -> b (ih iw) c'),
            PreNorm(oup, self.ff, nn.LayerNorm),
            Rearrange('b (ih iw) c -> b c ih iw', ih=self.ih, iw=self.iw)
        )

    def forward(self, x查看详情  
                
rkd知识蒸馏实战：使用coatnet蒸馏resnet(代码片段)
文章目录摘要最终结论数据准备教师网络步骤导入需要的库定义训练和验证函数定义全局参数图像预处理与增强读取数据设置模型和Loss学生网络步骤导入需要的库定义训练和验证函数定义全局参数图像预处理与增强读取数据设...  查看详情  
                
知识蒸馏irg算法实战：使用resnet50蒸馏resnet18(代码片段)
...用。模型压缩方法帮助模型在效率和精度之间进行折中。知识蒸馏是模型压缩的一种有效手段，它的核心思想是迫使轻量级的学生模型去学习教师模型提取到的知识，从而提高学生模型的性能。已有的知识蒸馏方法可以...  查看详情  
                
知识蒸馏irg算法实战：使用resnet50蒸馏resnet18(代码片段)
...用。模型压缩方法帮助模型在效率和精度之间进行折中。知识蒸馏是模型压缩的一种有效手段，它的核心思想是迫使轻量级的学生模型去学习教师模型提取到的知识，从而提高学生模型的性能。已有的知识蒸馏方法可以...  查看详情  
                
知识蒸馏deit算法实战：使用regnet蒸馏deit模型(代码片段)
文章目录摘要最终结论项目结构模型和lossmodel.py代码losses.py代码训练Teacher模型步骤导入需要的库定义训练和验证函数定义全局参数图像预处理与增强读取数据设置模型和Loss学生网络步骤导入需要的库定义训练和验证函数定义全...  查看详情  
                
知识蒸馏算法汇总(代码片段)
知识蒸馏有两大类：一类是logits蒸馏，另一类是特征蒸馏。logits蒸馏指的是在softmax时使用较高的温度系数，提升负标签的信息，然后使用Student和Teacher在高温softmax下logits的KL散度作为loss。中间特征蒸馏就是强迫Stu...  查看详情  
                
一文搞懂知识蒸馏knowledgedistillation算法原理(代码片段)
知识蒸馏算法原理精讲文章目录知识蒸馏算法原理精讲1.什么是知识蒸馏？2.轻量化网络的方式有哪些？3.为什么要进行知识蒸馏？3.1提升模型精度3.2降低模型时延，压缩网络参数3.3标签之间的域迁移4.知识蒸馏的...  查看详情  
                
yolov5/v7进阶实战|目录|安卓|pyqt5|剪枝✂️|蒸馏⚗️|flaskweb|改进教程
...LOv5剪枝|模型剪枝理论篇YOLOv5剪枝💖|模型剪枝实战篇知识蒸馏|知识蒸馏理论篇知识蒸馏🌟|YOLOv5知识蒸馏实战篇知识蒸馏🌟|YOLOv7知识蒸馏实战篇YOLOv5安卓部署📱|理论+环境配置+实战PyQt5|PyQt5环境配置及组...  查看详情  
                
知识蒸馏之自蒸馏
知识蒸馏之自蒸馏@TOC知识蒸馏之自蒸馏本文整理了近几年顶会中的蒸馏类文章（强调self-distillation）,后续可能会继续更新其他计算机视觉领域顶会中的相关工作，欢迎各位伙伴相互探讨。背景知识-注意力蒸馏、自...  查看详情  
                
知识蒸馏：distillingtheknowledgeinaneuralnetwork
文章目录摘要1简介2蒸馏2.1匹配逻辑是蒸馏的特殊情况3MNIST的初步实验4语音识别实验4.1结果5在大的数据集上训练专家模型5.1JFT数据集5.2专家模型5.3为专家分配类5.4使用专家集合进行推理5.5结果6软目标作为正则化器6.1使用软目标...  查看详情  
                
知识蒸馏是不是具有整体效应？
】知识蒸馏是不是具有整体效应？【英文标题】：Doesknowledgedistillationhaveanensembleeffect?知识蒸馏是否具有整体效应？【发布时间】：2021-09-2313:22:21【问题描述】：我对知识蒸馏知之甚少。我有一个问题。有一个模型表现出99%的性...  查看详情  
                
目标检测yolov5遇上知识蒸馏(代码片段)
...pruning)稀疏表示(Sparserepresentation)模型量化(Modelquantification)知识蒸馏(Konwledgedistillation)本文主要来研究知识蒸馏的相关知识，并尝试用知识蒸馏的方法对YOLOv5进行改进。知识蒸馏理论简介概述知识蒸馏(KnowledgeDistillatio  查看详情  
                
浙大提出无数据知识蒸馏新方法
点上方人工智能算法与Python大数据获取更多干货在右上方 ··· 设为星标 ★，第一时间获取资源仅做学术分享，如有侵权，联系删除转载于：机器之心在无法获取到原始训练数据的情况下，你可以尝试一下...  查看详情  
                
模型加速知识蒸馏
 现状知识蒸馏核心思想细节补充 　　知识蒸馏的思想最早是由Hinton大神在15年提出的一个黑科技，Hinton在一些报告中将该技术称之为DarkKnowledge，技术上一般叫做知识蒸馏（KnowledgeDistillation），是模型加速中的一种重要的...  查看详情  
                
神经网络分类知识蒸馏
参考链接：知识蒸馏是什么？一份入门随笔https://zhuanlan.zhihu.com/p/90049906[论文阅读]知识蒸馏（DistillingtheKnowledgeinaNeuralNetwork）https://blog.csdn.net/ZY_miao/article/details/110182948DistillingtheKnowledgeinaNeuralNetwork[  查看详情  
                
知识蒸馏基本原理
...蒸馏的目的是在不同的温度下提出了特定的成分。说回到知识蒸馏ÿ  查看详情  
                
知识蒸馏基本原理
...蒸馏的目的是在不同的温度下提出了特定的成分。说回到知识蒸馏ÿ  查看详情  
                
知识蒸馏基本原理
...蒸馏的目的是在不同的温度下提出了特定的成分。说回到知识蒸馏ÿ  查看详情