Jarvis's Blog

记录生活，沉淀自己

Group Week Report

发表于 2023-04-10 更新于 2024-12-16

模型压缩与部署组工作进度（2023.4.3-2023.4.9）

高扬城

I-ViT模块复现：完成I-LayerNorm的复现；
基金申请审稿意见回复；
长江流域非法活动监测项目初版方案及可行性分析；
组会ppt编写；

李亚伟

workshop

在用L1 charbonnier损失进行预训练后，继续使用L2损失训练 -> PSNR：27.76 -> 27.81 上升
改进注意力模块：1.增大感受野 2.部分卷积用分组卷积替代 -> Params: 25,664 -> 24,160 下降 FLOPs: 2.949 -> 2.776 下降，但是runtime反而上涨了 27.8 -> 30.0，tflite对分组卷积算子的支持不好

目前最好结果

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
SWAT_5	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 250,000	REDS	27.811176	0.7763541	25,664	27.6 (FP16_TFLite GPU Delegate)	2.949

work
- 搜集在REDS数据集上完全相同实验设置的paper，汇总相关指标情况
- 剪枝/权重聚类的代码之前在基于单帧的单输入单输出模型上跑通，现模型多输入多输出，进行调整后现已跑通

后期计划

高扬城

完成I-ViT模型复现；
完成组会ppt编写；
尝试在Jetson nano或raspberry pi上使用TVM进行模型部署；

李亚伟

work
- 完成当前多输入多输出模型的INT8/FP16的量化部分

模型压缩与部署组工作进度（2023.7.03-2023.7.16）

李亚伟

video super-resolution on mobile device
- FANI 代码整理，上传github
NeurIPS 审稿
补充PPT: 模型压缩部署部分
Jetson Nano 部署 ZeroDCE,远远达不到实时性要求,处理单张512×512图片暗光增强耗时 > 2 min。具体结果如下：

苗康

调研了李亚伟推荐的几篇量化文献:ZeroQ,HAWQ-V3，调试了 micronet 项目上几个量化操作的 demo:QAT/PTQ -> QAFT
请假回家，处理家里一些杂事

后期计划

李亚伟

调研了解最新压缩量化进展，寻找下个工作方向
8-bit 浮点数量化项目(FP8 quantization)高通已开源，测试了解下有无follow的空间
trt_pose 姿态估计项目摄像头随动功能实现
Bingda机器人小车文档学习

苗康

和模型压缩组内成员讨论下一步的选题方向
熟悉模型压缩方向的最新进展

模型压缩与部署组工作进度（2023.8.07-2023.8.13）

苗康

撰写 icdm审稿意见
参与韦炎炎师兄项目书撰写，即围绕“复杂环境下对监控画面进行增强和实时分析”主题，调研了两个比较细分的小方向，一篇综述是Areview ofcomputer visionbased structuralhealth monitoring at localand globallevels，利用计算机视觉对建筑进行健康检测;另一篇综述是 Anomaly Detection in RoadTrafic Using Visual Surveillance:ASurvey，调查了基于计算机视觉和视觉监控的技术来理解交通违规或其他类型的道路异常的相关技术
撰写开题报告，以轻量化超分为主题

李亚伟

PRCV审稿
FP8 Quantization 调研
- FP8 Quantization: The Power of the Exponent (Qualcomm_NeurIPS 2022)
  1. FP8更适应离群值多的场景
  2. PTQ时精度优于INT8，QAT时精度比INT8略差
- FP8 FORMATS FOR DEEP LEARNING (NVIDIA/Arm/Intel_ArXiv 2022.09) -> 训练推理统一数据格式FP8
  1. FP8 可以加速训练和减少训练所需的资源，同时方便部署且可以保证训练出的精度
  2. INT8 量化模型通常需要进行校准或微调，训练与推理数据类型不一致不便于部署，且通常精度会下降
- FP8 versus INT8 for efficient deep learning inference (Qualcomm_ArXiv 2023.06) -> FP8 目前在性能和精度上不能取代INT8推理，目前INT4-INT8-INT16是边缘端推理的最优解
  1. PTQ时在离群值显著的情况下，FP8相较INT8有精度优势; 通常这种情况可以通过W8A16混合精度以及QAT来解决
  2. FP8推理硬件开销大, FP8 MAC 单元效率比 INT8 低50%至180%
  3. 为了更高效，已经有一些INT4量化的工具, 但到目前为止并没有FP4相关的工作
- Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models (MSRA_ArXiv 2023.05) -> Layer wise混合精度LLM
Jetson Nano 部署暗光增强 ZeroDCE++,处理单张512×512图片耗时约10ms,但有波动(最高4931.46 ms/张),基本满足实时性要求
Jetson Nano 部署 Face Tracking,结合之前的Pose Estimation 达不到实时30 frame/s的要求

后期计划

苗康

确定项目书的模板论文，补充项目书
寻找新工作的方向

李亚伟

撰写开题报告
尝试集成Face Tracking 和 Pose Estimation实现相机角度跟随人体并进行姿态估计

模型压缩与部署组工作进度（2023.8.14-2023.8.27）

苗康

完成了项目申请书的撰写工作，在师兄的指导下修改了项目书的背景，相关工作，技术路线等内容
开题报告
调研并学习了商汤的MOBench 量化工具，针对不同工作的训练pipeline存在差异导致复现结果不同，其提供了统一的理论算法和量化策略。打算作为接下来一段时间的研究方向

李亚伟

机载广域持续监视方案调研,PPT制作
开题报告
jetson nano 项目：Face Tracking + Pose Estimation
- 原有基于nvidia官方trt_pose项目的姿态估计推理速度慢,现基于Shanghai AI Lab 2023最新的轻量姿态估计项目RTMPose进行部署
- 调研了解商汤MMdeploy 和 MMPose项目，编译安装相关依赖并在jetson nano上搭建了部署环境
- 完成了驱动舵机调整摄像头位置的C++代码，后续通过ctypes库实现在py文件中调用此部分调整摄像头姿态的C++代码

后期计划

苗康

深入了解运用MQBenche

李亚伟

完成jetson nano 项目：Face Tracking + Pose Estimation
调研了解TensorRT/TNN/MNN/NCNN等推理框架，重点尝试运用TensorRT加速RTMPose的推理
参加大湾区算法比赛：视频插帧 + 单目深度估计

模型压缩与部署组工作进度（2023.9.25-2023.10.08）

苗康

参加视频插帧赛道的比赛，结果不太行。
和师兄交流，修改icdm论文里的语法、格式问题。
找到一篇23CVPR的论文“CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input”，其动机与我icdm大致相同，区别在于这篇论文为每个patch选择量化比特，我的论文为每个patch选择block数。而且这篇论文很大程度上借鉴了21CVPR “ARM: Any-Time Super-Resolution Method”依然能中，说明这个方向仍然可以继续探索。

王明申

审硕士论文抽检
和师兄一起参加大湾区比赛，熟悉比赛流程，学习不同任务的模型调参工作等。
EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models（arXiv 23.10.05）PTQ的时间 QAT的精度
- 动机：现有PTQ量化在DM中W4A4量化时无法产生较好效果，QLoRA缺点：无法将LoRA权重与量化模型权重相融合。
- 提出量化感知低秩适配器QALoRA，将LoRA权重与FP模型权重合并共同量化至目标位宽，权重量化：channel-wise，激活量化：layer-wise。
- Activation量化：将LSQ量化方法运用在每一步去噪步骤中，单独优化激活量化尺度。

李亚伟

大湾区单目深度估计比赛：
- 数据的理解存在偏差，涉及共计6个不同数据集的ground truth, label的标签意义未能理解清(如单位mm还是m, skymask, validmask等等)
- 选择部分结构清晰(仅包含imgs, gts)的数据集送入目前的SOTA模型 ZoeDepth 对其 metric bins module 进行微调，结果训练后的精度比原作只在 NYU Depth V2 数据集上进行微调的效果还差
- 目前的提交的结果：A榜 42/60, B榜决定最终排名尚未出结果
ICDM camera ready 版本修改/国奖申请答辩

后期计划

苗康

研究CABM的量化部分，然后迁移到我借鉴的baseline论文“Adaptive patch
exiting for scalable single image super-resolution”里看看效果。

王明申

从ECCV 2022 CADyQ和CVPR2023 CABM 两篇论文中，寻找优化量化SISR任务的角度。
从其他方向寻找量化工作角度，如大模型Diffusion Model。

李亚伟

参照SISR量化思路(ECCV 2022 CADyQ, WACV 2022 DAQ)，搭建基于目前SOTA模型 (BasicVSR++/VRT/RVRT) 的量化baseline

模型压缩与部署组工作进度（2023.10.09-2023.10.15）

苗康

注册提交icdm终稿
研究几篇超分量化相关工作ARM、CABM、CADYQ代码，ARM代码有一处存在疑问，作者暂未回复
确定工作思路，即基于patch确定对应的网络block数，再根据block数确定量化位数，不过原理方面解释性不强

王明申

复现SR量化任务APE、CADyQ、CABM模型中baseline效果
寻找SR量化trick
EQ-Net：Elastic Quantization Neural Networks（ICCV 2023）
OFA 动机：不同硬件支持的量化形式多样，现有解决方案局限性需要迭代训练优化
- 对于权重量化提出从偏度和峰度信息正则化
- 提出GPG类似知识蒸馏结构组渐进式指导，CQAP MLP结构选择粒度和对称性，最后用遗传算法加快搜索

李亚伟

Video Super-Resolution Quantization
- 基于目前在Vid4、Vimeo90k、REDS数据集上SOTA模型 BasicVSR++ 进行channel-wise distribution-aware 量化pipeline的搭建（目前尚没有视频超分量化超分方向的baseline，代码难度较大）
- 尝试引入在其它视频感知任务（Human Pose Estimation，Semantic Segmentation，Video Object Segmentation）有效上的方法，参考 ICCV2023 ResQ 将网络中相邻帧的激活之间的残差用于量化，更小的方差有利于缩小量化误差
ICDM注册提交

后期计划

苗康

根据上述工作思路，继续搭建模型架构，同时查找相关论文，思索更好的原理解释

王明申

继续寻找SR量化trick
深度阅读现有SR量化工作代码

李亚伟

结合 ResQ 完成 BasicVSR++ 量化pipeline的搭建

模型压缩与部署组工作进度（2023.10.16-2023.10.22）

苗康

继续上周计划，基于patch做超分量化任务，调试代码，修改模型结构
修改专利
调研了几篇剪枝超分方面的文章，Aligned Structured Sparsity Learning for Efficient Image Super-Resolution (nips2021), Learning efficient image super-resolution networks through structural regular pruning (ICLR2022), 结论是由于超分网络存在不少跳连和残差，剪枝在超分领域应用的泛化性不是很好

王明申

SCIS审稿
通过pdb方式阅读SR量化代码

李亚伟

Video Super-Resolution Quantization
- 参考 GPTQ 完成了 BasicVSR++ (未涉及ViT)量化的基础部分
- 阅读论文,了解其它几个SOTA模型(ViTs)是否有需要单独改进的模块：
  - CVPR2022: TTVSR
  - NIPS2022: PSRT, RVRT
  - CVPR2023: IART
SelecQ latex 排版调整,期刊注册提交
学校HPC实例到期, 实验室浪潮集群上 Docker 镜像搭建
专利修改

后期计划

苗康

继续搭建模型结构

王明申

阅读ICCV2023中Workshop关于Low-Bit Quantized Neural Networks的汇报
阅读Transformer或LLM有关量化的文章

李亚伟

结合 ResQ 改进视频超分ViTs量化模块
深度神经网络课程PPT制作

模型压缩与部署组工作进度（2023.10.23-2023.10.29）

苗康

初步搭建出基于patch的超分量化框架，效果很差，判定是代码问题，又因为存在创新型的问题，暂时放一放
与组内协作，参与视频超分工作，研究BasicVsr++视频超分模型。

王明申

阅读论文
- VSR：BasicVSR（CVPR 2021）、BasicVSR++（CVPR2022）
- Quantization：
  - EfficientViT（ICCV 2023 MIT Transformer轻量化对分类、分割、复原三个领域中general的模型轻量化且效果极好）
  - Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective（CVPR 2023 从量化误差导致震荡角度出发，优化量化结果）
跑通BasicVSR++代码
通过TensorRT量化框架对Yolov7模型实现自动插入量化节点量化，mAP掉了0.03%

李亚伟

Video Super-Resolution Quantization
- 试用百度 paddleslim 分别用静态动态量化（PTQ）对 BasicVSR++ 进行量化
深度神经网络课程PPT制作

后期计划

苗康

参与视频超分工作

王明申

阅读量化文章
学习网络模型量化，并完成VSR任务量化

李亚伟

配合完成 paddleslim 量化 BasicVSR++

模型压缩与部署组工作进度（2023.10.30-2023.11.05）

李亚伟

Video Super-Resolution Quantization
- BasicVSR++ PTQ：量化过程有bug正在解决
  1. BasicVSR++ torch模型转onnx模型并检查
  2. 激活校准，产出量化参数: scale zero_point
  3. 权重调整，提升量化精度
  4. 量化误差分析，定位量化问题
- note: 目前 BasicVSR++ 的 PTQ 基于开源工具 Dipoorlet 进行，优点代码简洁明了易修改，相较百度框架 paddleslim 便于快捷验证idea; VSR 量化方法成熟后可进一步迁移至 paddleslim
读文献找idea提升PTQ精度

苗康

一篇 Neural Network 审稿工作
参考 paddleslim 官方文档示例完成yolov5 PTQ:
- 将pytorch的pt文件转化为onnx格式
- 将onnx文件输入paddleslim执行脚本输出模型及权重文件
- 迁移部署到tensorRT平台部分暂不清楚
- 结果：

王明申

阅读论文
- VSR：ESPCN(CVPR 2017)
- Quantization：
  1. Efficient LLM Inference on CPUs (arxiv 2311.00502, Intel 精度几乎无损)
  2. DAQ: Channel-Wise Distribution-Aware Quantization for Deep Image Super-Resolution Networks (WACV 2022 ISRQ; 在Ablation Study中分析了 Gaussian, Uniform, Laplacian, Gamma分布对channel-wise量化的影响)
ICASSP 2024审稿
完成对YOLOv7网络模型的手动量化节点插入，并通过敏感层分析，逐层网络分析及打印量化对精度影响最大的Top10层。

后期计划

苗康

paddleslim yolov5 PTQ 过程迁移到 BasicVSR++ 上
尝试将 ResQ 思路用 paddleslim 实现

王明申

阅读量化文章，寻找新的视频超分量化idea
学习网络模型量化

李亚伟

解决 dipoorlet 量化 BasicVSR++ 遇到的bug

模型压缩与部署组工作进度（2023.11.06-2023.11.12）

苗康

深度神经网络原理课程ppt制作。
利用paddleslim其中的AutoCompression接口对yolov5进行自动压缩（包括量化和蒸馏两部分），代码中numpy库的api有冲突，在调试。
利用3090比较了各类框架在yolov5上的部署测试。

王明申

阅读论文
- VSR：DRDVSR(CVPR 2018)
- Quantization：Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks(解决正则化损失与复原损失冲突问题)
Neurocomputing审稿
完成YOLOv7网络PTQ、QAT量化学习，从手动加入QDQ节点，到逐层分析量化的敏感度，对于敏感度高的层进行处理，对输入Concat节点前的多个输出节点做统一scale处理，最后通过训练迭代优化量化损失，导出量化模型的ONNX模型。
分析VSR量化任务的难点，早期VSR任务模型具有更简单的网路结构，近几年的VSR任务模型结构中可能含有不利于通用量化的网络模块，这样就需要手动去加入适配的QDQ节点，难度大且无法做公平的对比实验。

李亚伟

Video Super-Resolution Quantization
- BasicVSR++ 采用 Dipoorlet PTQ：量化过程有不支持动态输入的问题, 即不支持视频随机长度(time_step)的问题, github提了issue 暂未有回复
- BasicVSR++ 采用 MQBench PTQ: BasicVSR++ 模型 forward 过程存在动态控制流, 即控制流的判断条件含有运算变量(Input/Activation)参与, 而MQBench调用 torch.fx 的 symbolic_trace 完成 forward 过程计算图捕捉, 其本身的限制不支持动态控制流。正尝试：
  1. 把模型的动态控制流用静态的代替
  2. torch 2.0 新发布的 torch.compile 也即 (TorchDynamo), 了解后尝试来解决模型 forward 中广泛存在的动态控制流
RustDesk 中继服务搭建, 降低远程桌面的延迟

后期计划

苗康

解决自动压缩的bug, 采用paddleslim对yolov5 PTQ, 分析其量化分析工具及精度重构工具。

王明申

从较早的VSR任务模型开始着手, 类比用TensorRT框架对YOLOv7的量化尝试对VSR模型进行量化。
学习对于不同结构的量化op方法。

李亚伟

推进 VSR 模型的常规量化(Naive PTQ)的工作
实习相关工作

MAI 2023 Mobile VSR Workshop Log

发表于 2023-02-17 更新于 2024-12-16

Workshop and Challenges @ CVPR 2023

Efficient Super-Resolution Challenge(ESR)

经典baseline:
- information multi-distillation block,IMDN (2019)
- Residual feature distillation block,RFDN (2020)
- Residual Local Feature Network,RLFN (ByteESR2022)
初期调试跑起来时，目录名称有一点变化就会在别处导致意想不到的错误:(
很多队伍都用到了Quantization Aware Training (QAT)
2022参赛上榜的网络结构和权重都有提供
results

Model	Dataset	Val PSNR	Val Time [ms]	Params [M]	FLOPs [G]	Acts [M]	Mem [M]	Conv
trained_rfdn_best	DIV2K_val(801-900)	28.73	37.62	0.433	27.10	112.03	788.13	64
RFDN_baseline_1	DIV2K_val(801-900)	29.04	41.38	0.433	27.10	112.03	788.13	64
RFDN_baseline_2	DIV2K_val(801-900)	29.04	43.86	0.433	27.10	112.03	788.13	64
RFDN_baseline_3	DIV2K_val(801-900)	29.04	37.59	0.433	27.10	112.03	788.13	64
RFDN_baseline_4	DIV2K_val(801-900)	29.04	34.20	0.433	27.10	112.03	788.13	64
IMDN_baseline_1	DIV2K_val(801-900)	29.13	45.11	0.894	58.53	154.14	471.78	43
IMDN_baseline_2	DIV2K_val(801-900)	29.13	45.03	0.894	58.53	154.14	471.78	43
IMDN_baseline_3	DIV2K_val(801-900)	29.13	44.44	0.894	58.53	154.14	471.78	43

Mobile AI workshop 2023

测试可以用自己手机，也可使用提供的远程设备(速度慢有延迟)
2022 tracks

Track	Sponsor	Evaluate_Platform	Final_Phase_Team/Participants
Bokeh Effect Rendering 背景虚化	Huawei	Kirin 9000’s Mali GPU	6/90
Depth Estimation	Raspberry Pi	Raspberry Pi 4	7/70
Learned Smartphone ISP	OPPO	Snapdragon’s 8 Gen 1	11/140
Image Super-Resolution	Synaptics	Synaptics VS680	28/250
Video Super-Resolution	MediaTek 联发科	MediaTek Dimensity 9000	11/160

2021 tracks

Track	Sponsor
Learned Smartphone ISP	MediaTek 联发科
Image Denoising	Samsung
Image Super-Resolution	Synaptics
Video Super-Resolution	OPPO
Depth Estimation	Raspberry Pi
Camera Scene Detection	Computer Vision Lab, ETH Zurich, Switzerland

计划参加track:Image Super-Resolution 3月份开始 -> 调整为track:Video Super-Resolution
Train 2021 anchor-based plain net (ABPN) 两次
- 200 epoch 时报错停掉一次
- 600 epoch 完整跑完，但loss上下波动不收敛
andriod对原作提供的TF-lite模型进行了测试,测试流程掌握了

MobileAI worshop: Video Super-Resolution

papers
1. Ntire 2019 challenge on video super-resolution: Methods and results
2. Ntire 2020 challenge on image and video deblurring
3. Pynet-v2 mobile: Efficient on-device photo processing with neural networks
  - Image Signal Process(ISP): 手机成像流程光->CMOS传感器->成像引擎ISP->AI(GPU)->图片；镜头和CMOS在将光学信号转化为由0、1、0、1组成的数字信号时可能存在细节上的遗漏和错误，而ISP单元的主要任务就是进行“纠错”、“校验”和“补偿”。
  - pynet模型便于移动端部署的mobile版本目的是end-to-end learned ISP,时间很近:2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
  - CNN based
4. Microisp: Processing 32mp photos on mobile devices with deep learning. In: European Conference on Computer Visio(2022)
5. Real-Time Video Super-Resolution on Smartphones with Deep Learning,Mobile AI 2021 Challenge: Report
  - Results and Discussion
    - Team Diggers 冠军方案基于Keras/Tensorflow 电子科技大学唯一一个使用循环连接（recurrent connections）来利用帧间依赖性获取更好重建结果，其他方案都是基于单帧超分的。
6. Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report
  - tutorial: https://github.com/MediaTek-NeuroPilot/mai22-real-time-video-sr. baseline:MobileRNN
  - scoring: Final Score = α · PSNR + β · (1 - power consumption) α = 1.66 and β = 50，注重PSNR和power consumption两个指标
  - Discussion:
    - The majority of models followed a simple single-frame restoration approach to improve the runtime and power efficiency. 大部分模型技术路线是降低单帧超分的运行时间和能量消耗，网络模型都比较浅
    - GenMedia Group(一家韩国公司) 基于上年度单帧超分冠军方案ABPN小改进而来，排名第6但psnr:28.40最好,是唯二psnr超过28的方案之一，另一个是221B团队基于RNN的方法
    - 基于RNN的方案推理速度较慢且能耗高
    - 总结：2022年来看设备上的视频超分CNN是适合的，因为CNN取得了runtime energy_consumption restoration_quality 的平衡
7. Sliding Window Recurrent Network for Efficient Video Super-Resolution
  - SWRN makes use of the information from neighboring frames to reconstruct the HR frame. 从相邻帧提取信息来重建高清帧,相比单帧超分的方法有丰富的细节。
  - An bidirectional hidden state is used to recurrently collect temporal spatial relations over all frames.使用双向隐藏状态来循环收集所有帧的时间空间关系。
  - Pioneer network: SRCNN
  - Video super-resolution: the most important parts are frame alignment
    - VESPCN and TOFlow: optical flow to align frames
    - TDAN and EDVR: deformable convolution. Especially, EDVR enjoys the merits of implicit alignment and its PCD module.
    - Incorporates recurrent networks, use the hidden state to record the important temporal information.
  - 在测试平台Runtime 10.1 ms、 0.80 W@30FPS,最后分数低问题就在这里，PSNR SSIM 比第一名MVideoSR（小米）都要好 -> 寻找加速计算和减小耗能的方法
8. Lightweight Video Super-Resolution for Compressed Video -> Compression-informed Lightweight VSR (CILVSR)
  - Recurrent Frame-based VSR Network (FRVSR, RBPN, RRN)
  - Spatio-Temporal VSR Network (SOF-VSR, STVSR, TDAN, TOFlow, TDVSR-L)
  - Generative Adversarial Network (GAN)-based SR Network
  - Video Compression-informed VSR Network (FAST, COMISR, CDVSR, CIAF)
9. RCBSR: Re-parameterization Convolution Block for Super-Resolution
  - ECBSR baseline
  - Multiple paths ECB re-parametrization
  - FGNAS
10. Deformable 3D Convolution for Video Super-Resolution
  - deformable 3D convolution
11. Efficient Image Super-Resolution Using Vast-Receptive-Field Attention(VapSR 有torch代码)
  - improving the attention mechanism
    - large kernel convolutions
    - depth-wise separable convolutions
    - pixel normalization -> train steadily
  - 相比bytedance的RLFN -> 性能sota,参数更少
12. LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices(无代码)
  - 针对手机端，结构简单，REDS 320x180 X4 upscaling -> psnr:27.51 ssim:0.769(有疑问这个结果到底是在手机上测出来的还是在手机上?)
  - REDS 320x180 X4 upscaling 执行速度快 139FPS -> FSRCNN: 45FPS ESPCN: 52FPS
  - 测试平台：Tensorflow-lite fp16 TF-Lite GPU delegate Xiaomi Mi 11 Qualcomm Snapdragon 888 SoC, Qualcomm Adreno 660 GPU, and 8 GB RAM
13. Fast Online Video Super-Resolution with Deformable Attention Pyramid
  - recurrent VSR architecture based on a deformable attention pyramid (DAP)
  - 对比RRN(mobile_rrn MAI VSR官方用例很慢) ->不适合用到MAI VSR中
    - Run[ms] fps[1/s] FLOPs[G] MACs[G]
      
      28 35.7 387.5 193.6
      
      38 26.3 330.0 164.8
2022 challenge methods (ranked)
1. MVideoSR(无代码)
  - paper title: ELSR: Extreme Low-Power Super Resolution Network For Mobile Devices
  - affiliation: Video Algorithm Group, Camera Department, Xiaomi Inc., China
  - methods:
    1. core idea: mobile friendly network which consumes as little energy as possible, discard some complex operations such as optical flow, multi-frame feature alignment, and start from single frame baselines.
    2. multi-branch distillation structure show significant increase in energy consumption while a slight increase in PSNR compared with the plain convolutional network of similar parameters. abandon multi-branch network architectures, and focus on plain convolutional SR networks.
    3. though attention modules(ESA, CCA and PA) bring performance improvement, the extra energy consumption introduced is still unacceptable
    4. architeture
      - discription: single frame input which only have 6 layers, of which only 5 have learnable parameters, including 4 Conv layers and a PReLU activation layer. Pixel-Shuffle operation (also known as depth2space) is used at last to upscale the size of output without introducing more calculation. The intermediate feature channels are all set to 6.
2. ZX VIP(无代码)
  - paper title: RCBSR: Re-parameterization Convolution Block for Super-Resolution
  - affiliation: Audio & Video Technology Platform Department, ZTE Corp., China
  - methods:
    1. core idea: trade-off between SR quality and the energy consumption, ECBSR as baseline. In consideration of the low power consumption optimize the baseline from three aspects,network architecture, NAS and training strategy.
    2. network architecture:re-parameterization technique in the deploy stage, replace the activate function PReLU with ReLU.the power consumption of tflite model with ReLU is less than PReLU. Meanwhile there is no apparent discrepancy in PSNR.Finally, in order to further reduce power consumption, the output of first CNN layer is added into the backbone output instead of original input because original input needs to be copied the number of channels. We use sub-pixel convolution to upsample image in the network.
    3. NAS: The objective function of FGNAS is task-specific loss and regularizer penalty FLOPs. FGNAS -> Kim, H., Hong, S., Han, B., Myeong, H., Lee, K.M.: Fine-grained neural architecture search. arXiv preprint arXiv:1911.07478 (2019)
    4. training strategy:replace L1 loss function with Charbonnier loss function because it causes the problem that the restored image is too smooth and lack of sense of reality.
    5. architeture
3. Fighter(无代码)
  - title: Fast Real-Time Video Super-Resolution
  - affiliation: None, China
  - methods:
    1. shallow CNN model with depthwise separable convolutions and one residual connection. The number of convolution channels in the model was set to 8, the depth-to-space op was used at the end of the model to produce the final output.
    2. architeture
4. XJTU-MIGU SUPER(无代码)
  - title: Light and Fast On-Mobile VSR
  - affiliation: School of Computer Science and Technology, Xi’an Jiaotong University, China MIGU Video Co. Ltd, China
  - methods:
    1. small CNN-based model. 示意图如下，总共训练了2600 epochs :(
    2. architeture
5. BOE-IOT-AIBD(无代码)
  - title: Lightweight Quantization CNN-Net for Mobile Video Super-Resolution
  - affiliation: BOE Technology Group Co., Ltd., China
  - methods:
    1. based on the CNN-Net architecture, its structure is illustrated in Fig 6. The authors applied model distillation, and used the RFDN CNN as a teacher model.
    2. architeture
6. GenMedia Group(无代码)
  - title: SkipSkip Video Super-Resolution
  - affiliation: GenGenAI, South Korea
  - methods:
    1. inspired by the last year’s top solution from the MAI image super-resolution challenge. added one extra skip connection to the mentioned anchor-based plain net (ABPN) model.
    2. architeture
7. NCUT VGroup(无代码)
  - title: EESRNet: A Network for Energy Efficient Super Resolution
  - affiliation: North China University of Technology, China Institute of Automation, Chinese Academy of Sciences, China
  - methods:
    1. also based their solution on the ABPN model.
    2. architeture

Run[ms]	fps[1/s]	FLOPs[G]	MACs[G]
28	35.7	387.5	193.6
38	26.3	330.0	164.8

ideas

尝试BasicVSR++的轻量化
在ABPN的基础上加入BasicVSR++的主要idea进行改进
尝试将Pynet_v2应用于video super_resolution -> relative complicated and tailored for ISP, so halt

先train MRNN baseline

环境

Python==3.8.10
Tensorflow-gpu==2.9.0
- 查看tensorflow cuda cudnn python 版本对照表： https://www.tensorflow.org/install/source_windows
Cuda==11.2
- CUDA: CUDA是一个计算平台和编程模型，用于在GPU上加速应用程序。CUDA版本指的是CUDA软件的版本
- CUDA Toolkit: CUDA Toolkit是包含CUDA库和CUDA工具链的软件包，用于开发和编译CUDA应用程序。
  - CUDA库: CUDA 库包含了 CUDA 编程所需的核心库文件，例如 CUDA Runtime 库、CUDA Driver 库、cuBLAS 库、cuDNN 库等。这些库文件提供了 GPU 加速的基本功能和算法，是 CUDA 编程的基础。
  - CUDA工具链：CUDA 工具链则包含了一系列辅助开发和调试 CUDA 程序的工具，例如 nvcc 编译器、CUDA-GDB 调试器、Visual Profiler 性能分析工具等。这些工具能够帮助开发者更方便地编写、调试和优化 CUDA 程序。
- note: 查看当前安装的显卡驱动最高支持的CUDA版本 nvidia-smi
- note: 查看CUDA工具链版本 nvcc –version
- CUDA Toolkit 与 Driver Version 对照：https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Cudnn==v8.7.0

官网：https://developer.nvidia.com/cudnn
cat /etc/os-release 查看linux版本
uname -m 查看cpu架构，cudnn有不同架构的版本 x86_64 PPC SBSA
tar -xvf解压缩后用以下命令安装并赋予所有用户读取权限

#!/bin/bash
sudo cp path_to_cudnn/include/cudnn*    /usr/local/cuda-11.2/include
sudo cp path_to_cudnn/lib/libcudnn*    /usr/local/cuda-11.2/lib64
sudo chmod a+r /usr/local/cuda-11.2/include/cudnn*   /usr/local/cuda-11.2/lib64/libcudnn*

Cudnn和Cuda 安装完需在/etc/profile配置环境变量PATH和LD_LIBRARY_PATH

1
2
3

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin
export CUDA_HOME=/usr/local/cuda

可将文件夹 /usr/local/cuda-11.2 与 /usr/local/cuda 软连接起来

1	ln -s /usr/local/cuda-11.2 /usr/local/cuda

也可以通过linux下的update-alternatives命令行工具来进行cuda版本的管理,先用sudo update-alternatives --install /usr/local/cuda(替代项名称) cuda(替代项链表名称) /usr/local/cuda-xx(实际路径) x(优先级)来安装配置cuda的多个替代项,sudo update-alternatives --config cuda切换CUDA默认版本,其本质是更改了以下软连接: /usr/local/cuda -> /etc/alternatives/cuda -> /usr/local/cuda-xx.x
用下面的命令查看cudnn版本,新版本查看cuDNN版本的命令为

1	cat /usr/local/cuda/include/cudnn_version.h \| grep CUDNN_MAJOR -A 2 # -A 选项用来指定匹配成功的行之后显示2行内容

结果
1. 用默认config.yml训练太慢了大约需要1周时间，中途停掉了
2. 用改进后config.yml训练。8小时左右训练完成，但是loss很大
3. 结合往年此赛道总结文章放弃训练提供的mobilernn baseline 思考其它基于cnn的模型

可以从NTIRE 2022 efficient super-resolution challenge选取baseline运用剪枝蒸馏等改进到移动端
- course project for NCSU’s Computer Science 791-025: Real-Time AI & High-Performance Machine Learning. 三板斧
  1. Pruning via NNI
  2. Quantization via NNI
  3. Hyper Parameter Optimization via NNI
  4. Color Optimization: RGB -> YCbCr
- 选取2022 NTIRE ESR冠军方案RLFN(Byte Dance)作为baseline,先将其模型转换为 tensorflow 版本在 REDS 数据集上直接进行VSR的测试 -> 中间软件依赖兼容性问题放弃RLFN torch->onnx->tensorflow路线
- 直接用tensorflow 重构 RLFN -> train成功但是精度不达标’psnr’: 25.574987, ‘ssim’: 0.69084775，需要调试改进
- 现在的首要问题是确定自己的tensorflow 版本RLFN 与原作的 torch 版本RLFN 是否一致 -> cease
- 可以先将其他模型利用torch_to_tensorflow 转化为tensorflow版本模型，并可视化查看效果 -> 可行而且看源代码不复杂，难点在torch onnx onnx-tf tensorflow-gpu 版本对照，静等比赛开始官方scripts
- 现在当务之急不是版本对照问题需要尽快找到往年的baseline跑起来，改起来 -> 跑此项目了解剪枝量化超参调整三板斧实际运用：https://github.com/briancpark/video-super-resolution.git -> 都是在调库 NNI
- Train baseline SWRN：https://github.com/shermanlian/swrn
  - 结构重参数化（structural re-parameterization）:用一个结构的一组参数转换为另一组参数，并用转换得到的参数来参数化（parameterize）另一个结构。只要参数的转换是等价的，这两个结构的替换就是等价的。
  - 先测试提供的ckpt-98 -> 测试结果’psnr’: 27.931335, ‘ssim’: 0.7803563
  - 缩减recon_trunk_forward / recon_trunk_backward / recon_trunk 的 block_num到2, train from scratch 看结果
- 按照去年赛道冠军方案MVedioSR的ELSR搭建pipeline
  - L1 loss(Mean Absolute Error, MAE) -> 样本预测值与标签之间差的绝对值取平均, 对异常值不敏感,鲁棒性更强; 对于接近零的数, 梯度为常数, 没有逐渐变小的趋势, 容易出现震荡现象
  - L2 loss(Mean Squared Error, MSE) -> 样本预测值与标签之间平方差取平均, 对异常值敏感,鲁棒性不强; 对于接近零的数, 梯度随着误差的减小而逐渐减小, 避免了震荡现象。
  - TensorFlow中的内置损失函数非常丰富，包括L1、L2、L1_Charbonnier和MSE等常见的损失函数。这些损失函数都在tf.keras.losses模块中实现。具体来说，可以使用以下函数调用这些损失函数：
    - L1损失函数：tf.keras.losses.mean_absolute_error(y_true, y_pred)
    - L2损失函数：tf.keras.losses.mean_squared_error(y_true, y_pred)
    - L1_Charbonnier损失函数：可以自定义实现，也可以使用以下库中的实现：TensorFlow Addons（需要单独安装）。
    - M2损失函数：tf.keras.losses.mean_absolute_percentage_error(y_true, y_pred)
    - note: 这些函数的参数都是y_true和y_pred，分别表示真实值和预测值。
  - L1 Loss: L1 Loss: $L_1 = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y_i} \right|$
  - L2 Loss (MSE): $L_2 = \frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y_i} \right)^2$
  - L1 Charbonnier Loss: $L_{Charbonnier} = \frac{1}{N} \sum_{i=1}^{N} \sqrt{ \left( y_i - \hat{y_i} \right)^2 + \epsilon^2 }$
  - M2 Loss: $L_{M2} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{y_i - \hat{y_i}}{y_i + \epsilon} \right)^2$，其中 $\epsilon$ 为一个较小的数，如 $10^{-6}$，用于防止分母为零。
可以看看tflite加速的那些operations去更改模型
尝试将torch VapSR 从单图像超分向视频超分迁移
- 构建模型的tensorflow 代码遇到个小坑: tf.keras.Sequential([upconv1,pixel_shuffle,lrelu,upconv2,pixel_shuffle]) 如果用两个一样的pixel_shuffle模块，用tf.keras.Sequential实现的时候必须用两个不一样的名称，否则无论如何Sequential内都只有一个pixel_shuffle模块
- ‘from .XXX import YYY’ 相对导入，Python 解释器会先从当前目录开始查找指定的模块或包,需要当前current.py文件在一个Python包内（创建一个空的 init.py 文件，即可将文件夹视为一个Python包）
- [B,H,W,48] - conv1X1升维 -> [B,H,W,64], conv1X1为[64,48,1,1]大小的Tensor
- blueprint conv(M: input_channels N: out_channels); BSconvU: 先用M X 1权重向量对输入作通道聚合, 变为只有1个feature map,然后再用N个K X K的卷积输出N个feature map.
对VapSR_2 剪枝量化
- 列表推导式：list = [expression for item in iterable]，其中 expression 是要添加到列表中的表达式，item 是可迭代对象中的每一项，iterable 是要迭代的对象。例如：metric_list = [func for name, func in self.metric_functions.items()]
- tfmot.sparsity.keras.prune_low_magnitude() 封装vapsr_3中的每一个tf.keras.layers.Conv2D进行剪枝
- 同一class下def的method默认第一个参数需要为self;一个method调用另一个method需要用 self.def()不能直接用 def()
- keras建立网络的方法可以分为keras.models.Sequential() 和keras.models.Model()、继承类三种方式。注意：tensorflow2.* 以后的版本可以直接使用tf.keras.Sequential()和tf.keras.Model()两个类。不用再使用keras.models的API
  - Keras提供两种API：Sequential API和Functional API。Sequential API是一种简单的线性堆叠模型，适用于许多简单的模型。但是，如果我们需要构建更加复杂的模型，比如有多个输入或输出的模型，那么就需要使用Functional API。
  - Functional API通过tf.keras.Model()实现，它提供了更加灵活的方式来定义模型的结构和层之间的连接。使用Functional API，我们可以创建具有多个输入和输出的模型，可以共享层，可以定义任意的计算图结构等等。相比之下，Sequential API则不能支持这些更高级的模型定义方式。
  - 因此，使用Functional API来构建复杂的模型是更加灵活和强大的选择，而通过tf.keras.Model()实现这个API是为了提供一种方便和一致的方式来定义和构建深度学习模型。
- / 表示普通的除法运算，例如 5 / 2 的结果为 2.5。它返回的是一个浮点数，即使两个操作数都是整数。 //表示整除运算，例如5 // 2 的结果为 2。
- =和+直接赋值给变量是不好的，因为它们只是简单地创建一个新的变量，而不是对现有变量进行原位操作。assign()和assign_add()是TensorFlow中的原地操作，它们直接将结果分配给现有变量，而不是创建一个新的变量。
- shell scripts(.sh)添加多行注释：<< COMMENT ... COMMENT, 在 Shell 中，<< 是 Here Document（文档嵌入）的语法，它可以用来将一段文本或代码块嵌入到 Shell 脚本中。
- pruning 过程model type 变化
  1. initial: type(self.model) == <class ‘VapSR_3.vapsr_3’> (i.e. Keras Subclass Model)
    - Keras Subclass Model是一种创建自定义模型的方式，相较于Sequential和Functional API而言，其提供更大的灵活性。使用Subclass Model，用户可以通过定义一个继承自tf.keras.Model的Python类来构建模型。使用Subclass Model的优点在于，它可以自由灵活地创建非线性、复杂的模型结构，也可以方便地重复利用模型代码。
  2. apply tensorflow.keras.Model() method -> type(functional_model) == <class ‘keras.engine.functional.Functional’>
  3. add tfmot.sparsity.keras.prune_low_magnitude() wrapper -> type(pruned_model) == <class ‘keras.engine.functional.Functional’>; 如果直接调用tfmot.sparsity.keras.prune_low_magnitude(functional_model, **pruning_params)还是会报错：ValueError: Subclassed models are not supported currently. :(
  4. add tfmot.sparsity.keras.prune_low_magnitude() wrapper with another method -> type(pruned_model_1) == <class ‘keras.engine.functional.Functional’>
  5. pruning -> type(pruned_model) == <class ‘keras.engine.functional.Functional’>
  6. 虽然type(pruned_model) == <class ‘keras.engine.functional.Functional’>，但是传入stripped_pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)就会报错：ValueError: Expected model argument to be a functional Model instance, but got a subclassed model instead: <keras.saving.saved_model.load.BSConvU object at 0x7f66f06fa520>
  7. pruned_model.layers == [<keras.engine.input_layer.InputLayer object at 0x7f66f06fa580>, <keras.saving.saved_model.load.BSConvU object at 0x7f66f06fa520>, <keras.engine.sequential.Sequential object at 0x7f66f06fa4f0>, <keras.layers.convolutional.conv2d.Conv2D object at 0x7f66f0717c40>, <keras.layers.core.tf_op_layer.TFOpLambda object at 0x7f66f80a7fd0>, <keras.engine.sequential.Sequential object at 0x7f66f80a7cd0>]
- pruning 去除tfmot.sparsity.keras.prune_low_magnitude() wrapper的报错就没停过 -> 直接构建Functional Model VapSR
- 详细解释 Layer Norm / Batch Norm / Instance Norm / Pixel Norm
  - Batch Norm：对每个特征通道（C）进行归一化，使用整个批次（N）中的样本的均值和方差。在每个 batch 的 channel 维度上计算均值和方差。
  - Layer Norm：对每个样本（N）进行归一化，使用所有特征通道（C）和空间维度（H，W）的均值和方差。在每个 layer 的所有 feature maps 上计算均值和方差。
  - Instance Norm：对每个样本（N）和每个特征通道（C）进行归一化，使用空间维度（H，W）的均值和方差。在每个 instance 的 channel 维度上计算均值和方差。
  - Pixel Norm：对每个样本（N）和每个像素位置（H，W）进行归一化，使用所有特征通道（C）的均值和方差。
  - 组会可以讨论下具体实现（作为最后一小部分）
  - VapSR 原作者 pixel norm torch 实现
  - ```
   class VAB(nn.Module):
       def __init__(self, d_model,d_atten):
           super().__init__()
           self.proj_1 = nn.Conv2d(d_model, d_atten, 1)
           self.activation = nn.GELU()
           self.atten_branch = Attention(d_atten)
           self.proj_2 = nn.Conv2d(d_atten, d_model, 1)
           self.pixel_norm = nn.LayerNorm(d_model)
           default_init_weights([self.pixel_norm], 0.1)
       
       def forward(self, x):
           shorcut = x.clone()
           x = self.proj_1(x)
           x = self.activation(x)
           x = self.atten_branch(x)
           x = self.proj_2(x)
           x = x + shorcut
           x = x.permute(0, 2, 3, 1) #(B, H, W, C)
           x = self.pixel_norm(x)
           x = x.permute(0, 3, 1, 2).contiguous() #(B, C, H, W)
  
           return x
       参考：https://blog.csdn.net/weixin_39228381/article/details/107939602    
```
- x = tf.constant([[1.,2.,4.,5.,7.,8.],[6.,7.,9.,10.,11.,12.],[2.,3.,5.,6.,8.,9.],[4.,5.,7.,8.,10.,11]])
- mean, variance = tf.nn.moments(x, axes, shift=None, keepdims=False, name=None) The mean and variance are calculated by aggregating the contents of x across axes. 例： tf.nn.moments(x,1) x.shape == [4,2,3] -> mean.shape == [4,1,3]
- 后续还需要花时间搞清楚 tf LayerNormalization GroupNormalization 在axis=list/tuple多轴的情况下，到底计算了多少mean和variance，换言之如何用这两个built-in layer做到随心所欲的控制normalization的粒度,妥协方法我觉得是利用transpose,转换轴(相当于torch permute)间接实现相关功能。
- *args 和 **kwargs 都是 Python 中用于传递可变数量参数的特殊语法。它们的主要区别在于：
  - *args 用于传递可变数量的位置参数，以元组(tuple)的形式传递给函数；
  - **kwargs 用于传递可变数量的关键字参数，以字典(dictionary)的形式传递给函数。
- 接下来需要尽快完成 pruning clustering quantization pipeline, 将runtime降到10ms左右
- 递归函数的return不是返回一个值然后程序结束，而是返回一个值到上一层的递归函数，直到return到最外层
- add_pruning_wrapper():
  - 通过Sequential.add()重建模型,在原模型就是Sequential的时候可行,但是原模型call() method加不进去
  - 原地替换setattr(object, name, new_model)难点:
    1. 递归当前tf.keras.layers.Conv2D不知道所属模块object 和 name
    2. pruned_model = copy.deepcopy(model)在复制的pruned_model上应用剪枝封装, subclassed tf.keras.Model() class -> custom object 需要全部重写method: get_config() from_config()
  - model.__dict__ 与dir(model) 区别
    1. model.__dict__ 返回一个字典对象，其中键是模型实例的属性名称(可用model.__dict__.keys()访问)，值是对应的属性值(可用model.__dict__.values()访问)。而 dir(model) 返回一个列表对象，其中包含模型实例的所有属性名称。
    2. 具体来说，model.__dict__ 只返回实例自身定义的属性，不包括其继承而来的属性。而 dir(model) 返回实例的所有属性名称，包括其自身定义的属性和继承而来的属性。
    3. model.__dict__ 返回的字典对象只包含可写的属性。而 dir(model) 返回的属性列表可能包含不可写的属性，例如只读属性或方法等。
  - pruned_model.layers[3] == <keras.layers.convolutional.conv2d.Conv2D object at 0x7ff557c57a60> 这一层是 Keras 自带的 Conv2D 层，而不是通过继承 tf.keras.layers.Layer 类来自定义的。因此，它不会在 __dict__ 属性中出现。
- strip_pruning_wrapper():
  - tfmot.sparsity.keras.strip_pruning(): Only sequential and functional models are supported for now.
  - recursively strip pruning wrapper -> success
- lr_scheduler: ConsineDecayRestarts
- pruning_train, clustering_train loss 与 pretraining train loss 相差很大, 50+ vs 10+ 有点问题
- quantization
  1. tensorflow quantize:
    - def quantize_scope(*args)
    - def quantize_model(to_quantize, quantized_layer_name_prefix=’quant_’)
    - def quantize_annotate_model(to_annotate)
    - def _add_quant_wrapper(layer)
    - def quantize_annotate_layer(to_annotate, quantize_config=None)
    - def quantize_apply(model, scheme=default_8bit_quantize_scheme.Default8BitQuantizeScheme(), quantized_layer_name_prefix=’quant_’)
    - def _extract_original_model(model_to_unwrap)
    - def _quantize(layer)
    - def _unwrap_first_input_name(inbound_nodes)
    - def _wrap_fixed_range(quantize_config, num_bits, init_min, init_max, narrow_range)
    - def _is_serialized_node_data(nested)
    - def _nested_to_flatten_node_data_list(nested)
    - def fix_input_output_range(model, num_bits=8, input_min=0.0, input_max=1.0, output_min=0.0, output_max=1.0, narrow_range=False)
    - def _is_functional_model(model)
    - def remove_input_range(model)
  2. *与**二者区别,及与C++ 中指针的区别:
    - * 和 ** 都是Python中的特殊符号，用于参数传递和元组、字典的解包操作。它们与C++中的指针有些类似，但也有不同之处。
    - * 用于元组的解包操作，可以将一个元组中的元素解包成一个一个的单独元素
    - ** 用于字典的解包操作，可以将一个字典中的键值对解包成一个一个的单独键和值
    - 在函数调用时，* 可以用于传递可变数量的位置参数，而 ** 可以用于传递可变数量的关键字参数，如: def foo(*args, **kwargs): …
    - 与C++中的指针类似，* 可以用于声明指针类型的变量，而 ** 则可以用于声明指向指针的指针类型的变量。但与C++不同的是，Python中的指针实际上是对象的引用，而不是内存地址，因此没有C++中的指针算术运算和指针类型转换等操作。
      - 与 C++ 不同的是，Python 中的对象引用是一个高级抽象，它们隐藏了对象的实际内存地址，因此 Python 中的引用和指针不是同一概念。在 Python 中，我们不需要显式地管理内存，而是由 Python 解释器自动处理内存管理的细节。因此，Python 中的引用更像是一个符号，它与实际的内存地址之间存在一个间接的映射关系。
  3. self 与 cls:
    - cls 是 Python 中类方法的第一个参数的常规名称。它指的是类本身而不是类的实例。它类似于在实例方法中使用 self。
    - 在类方法中，cls 用于访问类级别的属性和方法，以及创建类的新实例。
  4. 修好bug,在手机上测好 runtime; 目标: PSNR -> 28, SSIM -> 0.8, runtime -> 30ms
    - 从VapSR_3_2开始在手机上都跑不通runtime测试了
    - 通过tf.lite.TFLiteConverter.from_saved_model(‘path_to_model’)创建converter,转换为tflite模型后可以通过netron查看模型结构并分析可能的错误
    - 使用tf.lite.TFLiteConverter.from_keras_model()或者tf.lite.TFLiteConverter.from_saved_model()使用创建converter的话总会遭遇两个问题
      1. model input_size: [1,1,1,3] output_size[1,1,1,3] 异常
      2. Make sure you apply/link the Flex delegate before inference.
      3. 综上推荐配合model.save(‘path_to_model’)存为SavedModel格式，然后定义好concrete_func = model.signatures[tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY ],使用tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])规避掉这两个问题
  5. 通过QuantizeConfig和Quantizer配合实现layer activations weights的自定义量化策略
  6. 上下文管理器用于管理某个代码块的上下文环境
    - Python 中常见的上下文管理器包括 with open() as f 中的 open() 函数和 with tf.Session() as sess 中的 tf.Session() 函数等。
    - 在 with 代码块结束后，Python 会自动调用上下文管理器的 __exit__ 方法，以确保资源的释放和清理等工作的完成。同时，上下文管理器可以在 __enter__ 方法中完成一些初始化工作。在 with 代码块内部，可以使用上下文管理器返回的对象，来操作上下文环境中的资源
tensorflow复现高通torch QuickSRNet 8-bit 量化
- android_aarch64代表的是基于64位ARM架构的Android设备，也被称为ARMv8-A架构.通常用于高端设备，如智能手机和平板电脑。
- android_arm代表的是基于32位ARM架构的Android设备。通常用于低端设备，如廉价智能手机、平板电脑和物联网设备
困扰了至少3周的bug： TFlite GPU Delegate init Batch size mismatch -> solved
- 根据this link提前规避了tflite gpu delegate不支持全连接层，即利用1*1全连接层替代
- 从0到1一点点逐个测试可能出问题的模块，最终定位在pixel norm模块(由tf.reshape和tf.keras.layers.LayerNormalization构成)，换为LayerNormalization得到解决，PSNR甚至有一点点提升:)
奇怪的问题，在转换Functional Model为tflite模型时，import tensorflow.keras.backend as K 在模型中使用k.clip()时总是提示K未定义 -> 直接更换为tf.keras.backend.clip()解决
在训练Mobile VSR小模型时，GPU利用率低的问题
1. 不是由于CPU读取处理数据慢造成的，增加线程无效
2. 也不是batch size大造成的，减小batch size无效
3. 想要提高GPU利用率估计有两个途径,一是增大模型而是使用nvidia DALI数据读取加速库
感受野(receptive field) 计算
- 假设输入图像大小为$W_{in}\times H_{in}$，卷积核大小为$k\times k$，步长为$s$，当前卷积层的感受野大小为$F_{in}$，则下一层的感受野大小$F_{out}$为：
  
  $F_{out} = F_{in} + (k - 1) \times \text{dilation rate}$
  
  其中，$\text{dilation rate}$表示卷积核的膨胀率，如果不使用膨胀卷积，则$\text{dilation rate} = 1$。如果下一层是池化层，则$s = k$，并且不考虑膨胀率。
  
  设输入图像大小为$224\times 224$，第一个卷积层使用$3\times 3$大小的卷积核，步长为1，不使用膨胀卷积。则第一个卷积层的感受野大小为$3$。

results

milestone_0:

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]
SWRN_0	Origin	REDS	27.931335	0.7803562	43,472	25.6
SWRN_1	recon_trunk block num=2	REDS	27.820051	0.77666414	36,512	26.9
ELSR_0(vsr 22 winner)	Origin	REDS	26.716854	0.73988235	3,468	19.3
RLFN_0(esr 22 winner)	Origin	REDS	26.78721	0.7389487	306,992	-
VapSR_0	Origin	REDS	28.103758	0.7864979	154,252	5191.0
VapSR_1	Replace feature extraction conv and VAB’s 2 con1X1 with blueprint conv	REDS	28.02941	0.7845887	155,916	5798.0
VapSR_2	Replace feature extraction conv with blueprint conv and reduce Attention’s kernel size=3X3	REDS	28.021387	0.7831156	131,276	2694.0
VapSR_3	Correct custom realization of pixel normalization	REDS	28.018507	0.7836466	131,276	-
VapSR_3_1	Reduce VAB blocks from 11 to 5	REDS	27.826998	0.7771207	73,484	1222.0
VapSR_3_2	Realize Pixel Normalization with tf.reshape() and tf.keras.layers.LayerNormalization(); Reduce VAB blocks from 5 to 4	REDS	27.550034	0.7687168	64,108	error
VapSR_4	apply pruning, weights clustering to conv kernels	REDS	27.833515(suspect)	0.7771123(suspect)	32,054(64,108)
VapSR_4_2_0	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: GELU	REDS	27.666351	0.77187574	64,108
VapSR_4_2_1	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: RELU	REDS	27.539206	0.7669671	64,108
VapSR_4_3	Functional VapSR_4 with self customed pixel normalization get rid of layer normalization	REDS	27.651005	0.7715401	63,852

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: FP16
Acceleration: TFLite GPU Delegate

milestone_1:

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
VapSR_4_1	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: RELU, Attention using Partial conv	REDS	27.790268	0.77721727	59,468	654.0 (INT8_CPU)	7.462
SWAT_0	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=4)	REDS	27.842232	0.77754354	50,624	271.0 (FP16_CPU)	5.803
SWAT_1	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=2), replace fc with 1*1 conv	REDS	27.759375	0.77492595	33,984	252.0 (FP16_CPU)	3.900
SWAT_2	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv	REDS	27.760305	0.77487457	25,664	-	-
SWAT_3	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.761642	0.7748446	25,664	27.8 (FP16_TFLite GPU Delegate)	2.949
SWAT_3_1	Sliding Window, VAB Attention(large reception field=17), Partial Conv(point_wise: standard, depth_wise: groups=out_dim), Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.754656	0.77461684	24,160	30.0 (FP16_TFLite GPU Delegate)	2.776
SWAT_3_2	Sliding Window, VAB Attention(receptive field=17), Partial Conv(feature fusion maintains standard conv11), Channel Shuffle(mix_ratio=2), replace fc with 11 conv, replace pixel normalization with layer normalization	REDS	27.74189	0.7742521	26,016	32.4 (FP16_TFLite GPU Delegate)	2.996
SWAT_4	Sliding Window, VAB Attention, Replace partial conv with standard convlution, Remove Channel Shuffle, replace pixel normalization with layer normalization	REDS	27.785185	0.77523285	53,696	38.5 (FP16_TFLite GPU Delegate)	6.202
SWAT_5	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 250,000	REDS	27.811176	0.7763541	25,664	27.6 (FP16_TFLite GPU Delegate)	2.949
SWAT_6	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 150,000, Remove convs of hidden forward/backward	REDS	27.738842	0.7743317	21,056	23.6 (FP16_TFLite GPU Delegate)	2.417

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: INT8/FP16
Acceleration: CPU/TFLite GPU Delegate

OnePlus Recovery

发表于 2023-02-13 更新于 2024-12-02

Procedure

备份数据：相册安装应用名单
root 解锁
ADB 刷入TWRP Recovery镜像
刷入ROM（3种方法）
1. 电脑的磁盘列表中找到手机，复制ROM至手机的内部存储，复制完成后在recovery主菜单中，点击Install，点击ROM包，滑动后进行刷入。
2. 用命令将ROM推送至手机（filename.zip为ROM名称，可拖动ROM文件至命令窗口自动填入完整文件地址，或输入文件名前几个字母后按Tab键来自动补全文件名）
  - 检测设备是否连接: adb devices
  - 推送ROM: adb push ROM_name.zip /sdcard/
  - 推送完成后在recovery主菜单中点击Install，点击ROM包，滑动进行刷入。
3. 在recovery主菜单中，点击“高级”，点击ADB sideload，滑动底部按钮，在PowerShell窗口用命令：.\adb sideload ROM_name.zip
刷入Magisk
1. 目的:
  - 未获得 Google「认证」的设备无法从 Play 应用商店下载安装 Netflix，Google Pay、Pokémon Go 等应用不能在已 root 的设备上正常运行，改动过系统文件的 ROM 无法通过 OEM 渠道进行正常的 OTA 更新升级……
  - Magisk 的实现方式就像是一种魔法，当被挂载的 Magisk 分区被隐藏甚至被取消挂载时，原有系统分区的完整性丝毫未损，玩需要 root 验证的游戏、运行对设备认证状态有要求的应用甚至进行需要验证系统完整性的 OTA 更新都没有任何问题。
2. 方法:
  1. 在刷入前，我们先安装 Magisk App 来检查设备的信息，来确定进一步的操作。我们先到官方项目地址：https://github.com/topjohnwu/Magisk/releases 下载 apk 文件安装。
    - Tip:从 Magisk 22 开始，不再区分刷写用的 .zip 包与安装管理器用到的 .apk 应用安装包，二者合一且只有后缀的区别，默认提供 .apk 包，更改后缀为 .zip 后即可被刷写。
  2. 打开安装后的 Magisk App，像上面的最后一张截图一样，你能看到一项名为 Ramdisk 的值。请确保此项的值为「是」「True」，我们再进行下一步

Mobile Video Super-Resolution Work Log

发表于 2023-02-07 更新于 2024-12-02

Time:2023.2.7-2023.4.15

Paper Reading

PD-Quant: Post-Training Quantization based on Prediction Difference Metric（2022.12）
- 分析优化量化参数S Z用的各个Local Metrics (MSE or cosine distance of the activation before and after quantization in layers)
- PD Loss：引入Prediction Difference决定Activation Scaling Factors
- Distribution Correction (DC): intermediate adjust the activation distribution on the calibration dataset
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation
Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization（2022CVPR）
Computer Vision – ECCV 2022 Workshops
- Learning Multiple Probabilistic Degradation Generators for Unsupervised Real World Image Super Resolution (无监督图像超分)
- Evaluating Image Super-Resolution Performance on Mobile Devices: An Online Benchmark (SR模型直接部署基准测试)
- Efficient Image Super-Resolution Using Vast-Receptive-Field Attention (大感受野Attention图像超分)
- DSR: Towards Drone Image Super-Resolution (无人机图像超分)
- Image Super-Resolution with Deep Variational Autoencoders (变分自动编码器用于SISR)
- Light Field Angular Super-Resolution via Dense Correspondence Field Reconstruction (光场角超分辨率)
- CIDBNet: A Consecutively-Interactive Dual-Branch Network for JPEG Compressed Image Super-Resolution (JPEG压缩图像超分)
- XCAT - Lightweight Quantized Single Image Super-Resolution Using Heterogeneous Group Convolutions and Cross Concatenation (单图像超分)
- RCBSR: Re-parameterization Convolution Block for Super-Resolution (结构重参数视频超分)
- Multi-patch Learning: Looking More Pixels in the Training Phase (多patch训练策略SISR)
- Fast Nearest Convolution for Real-Time Efficient Image Super-Resolution (Nearest Convolution替代copy原图像用于depth_to_space操作)
- Real-Time Channel Mixing Net for Mobile Image Super-Resolution (单图像超分：channel mixing using 1*1 conv)
- Sliding Window Recurrent Network for Efficient Video Super-Resolution (视频超分)
- EESRNet: A Network for Energy Efficient Super-Resolution (视频超分)
- HST: Hierarchical Swin Transformer for Compressed Image Super-Resolution (压缩图像超分)
- Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration (压缩图像超分)
Video Super-Resolution With Convolutional Neural Networks(2016)
- 将当前帧与相邻帧简单concate，提升超分质量
Frame-Recurrent Video Super-Resolution(2017)
- 利用前帧预测的HR结果补偿当前帧超分
Enhanced Deep Residual Networks for Single Image Super-Resolution(2017)
- ResBlock: 相较之前的工作减少ReLU等激活的使用
- Upsample: conv+shuffle
TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution(CVPR2020)
- 时序可变形卷积对齐网络用于缓解超分的伪影现象
Efficient Reference-based Video Super-Resolution (ERVSR): Single Reference Image Is All You Need(WACV2023)
- 单个参考帧来超分整个低分辨率视频序列，不使用每个时间步的LR帧作为参考，而只用中心时间步的一帧作为参考
- 基于注意力机制做相似性估计和对齐操作
- 动机：加速推理，减少内存消耗
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond(CVPR2021)
BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment(CVPR2022)
Multi-scale attention network for image super-resolution(ECCV2018)
- Multi-scale cross block(MSCB) 3个并行但不同dilation的卷积提取特征并融合
- Multi-path wide-activated attention block(MWAB) 3个并行支路: 卷积 + spatial attention + channel attention concate
- 缺点: 常规的channel attention采取的global average pooling 不一定能实现正确考虑通道间相关性的目的
Deep Video Super-Resolution using Hybrid Imaging System(2023)
- 任务: 利用一段LR高帧率视频(main video)和一段HR低帧率视频(auxiliary video)重建HR高帧率视频
- 模型3部分：
  1. 主视频超分产生基础的高清帧
  2. 辅助视频细节特征提取并进行对齐
  3. 混合视频信息聚集融合
STDAN: Deformable Attention Network for Space-Time Video Super-Resolution(2023)
- 变形注意力网络 deformable attention network
- 长短距离特征插值 long short-term feature interpolation (LSTFI)
- 时空变形特征聚集 spatial–temporal deformable feature aggregation (STDFA)
ShuffleMixer: An Efficient ConvNet for Image Super-Resolution(NTIRE2022)
- large convolution and channel split-shuffle operation 大卷积核搭配通道分割-混合操作
- add the Fused-MBConv after every two shuffle mixer layers 两层shuffle-mixer层之后接Fused-MBConv层克服局部特征提取不完善的问题
An Implicit Alignment for Video Super-Resolution (ArXiv 2023)
- static upsample evolution: 静态插值上采样如 bilinear、nearest插值的动态化演进
- implicit attention based alignment integrate with local window key&value position encoding and query(motion estimation/flow) position encoding: 基于注意力隐式对齐并结合局部窗口键值位置编码和运动补偿位置编码
Rethinking Alignment in Video Super-Resolution Transformers (NIPS 2022)
- 矩阵点乘：tf.multiply(A,B) = A * B
- 矩阵叉乘：tf.matmul(A,B) = A @ B

Idea

发现以前文章的问题尝试改进和解决 -> 单纯比较runtime必败
transformer PTQ -> 暂时不考虑, 专心workshop提性能
从第一个work出paper的角度,可以考虑新的压缩方面的idea应用于MAI video super resolution
- dataset -> train: REDS, test: REDS4(Clips 000, 011, 015, 020 of REDS training set)
- mobile video super resolution related paper
- frontier -> Optical Flow
尝试blind video super resolution -> 放弃

Compared Solutions	Model Size, KB	PSNR	SSIM	Runtime, ms
MVideoSR	17	27.34	0.7799	3.05
ZX_VIP	20	27.52	0.7872	3.04
Fighter	11	27.34	0.7816	3.41
XJTU-MIGU SUPER	50	27.77	0.7957	3.25
BOE-IOT-AIBD	40	27.71	0.7820	1.97
GenMedia Group	135	28.40	0.8105	3.10
NCUT VGroup	35	27.46	0.7822	1.39
Mortar ICT	75	22.91	0.7546	1.76
RedCat AutoX	62	27.71	0.7945	7.26
221B	186	28.19	0.8093	10.1

了解最新的基于数据集 REDS / Viemo-90K / Vid4 / UDM10 / SPMCS / RealVSR的最新研究进展

Paper	Source	Training Set	Testing Set
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search	ICCV 2021	DIV2K	Set5, Set14, B100 and Urban100
LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices	2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP)	REDS	REDS
Cross-Resolution Flow Propagation for Foveated Video Super-Resolution	Winter Conference on Applications of Computer Vision. 2023	REDS	REDS
Online Video Super-Resolution with Convolutional Kernel Bypass Graft	arxiv 2022.8	REDS	REDS
Real-Time Super-Resolution for Real-World Images on Mobile Devices	2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)	DIV2K	DIV2K, Set5, Set14, BSD100, Manga109, and Urban100
Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting	CVPR 2023	VSD4K	VSD4K
Rethinking Alignment in Video Super-Resolution Transformers	NeurIPS 2022	REDS	REDS

SWAT的PSNR最好要刷到28以上, 完成 pruning, weight clustering, INT8/FP16 quantization
测试fintune之后的tensorflow模型和tflite模型 ->
对比的方法要在同一设置下 -> 设置对比排行榜
实验：SWRN整体框架不变替换Partial Standard Conv加持的VAB -> PSNR：27.76 无明显提高
查资料理解：attention机制怎样实现，怎样起作用，是否需要级联叠加
应用MobileOne结构重参数

Metrics

Full-Reference

Peak Signal to Noise Ratio (PSNR)
Structural SIMilarity (SSIM)
Gradient Magnitude Similarity Deviation (GMSD)

No-Reference

Naturalness Image Quality Evaluator (NIQE)
Blind/Referenceless Image Spatial QUality Evaluator (BRISQUE)
Distortion Identification-based Image Verity and INtegrity Evalutation (DIIVINE)
BLind Image Integrity Notator using DCT-Statistics (BLIINDS)

Results

Milestone_0

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
VapSR_4_1	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: RELU, Attention using Partial conv	REDS	27.790268	0.77721727	59,468	654.0 (INT8_CPU)	7.462
SWAT_0	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=4)	REDS	27.842232	0.77754354	50,624	271.0 (FP16_CPU)	5.803
SWAT_1	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=2), replace fc with 1*1 conv	REDS	27.759375	0.77492595	33,984	252.0 (FP16_CPU)	3.900
SWAT_2	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv	REDS	27.760305	0.77487457	25,664	-	-
SWAT_3	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.761642	0.7748446	25,664	27.8 (FP16_TFLite GPU Delegate)	2.949
SWAT_3_1	Sliding Window, VAB Attention(large reception field=17), Partial Conv(point_wise: standard conv, depth_wise: group conv), Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.761642	0.7748446	25,664	27.8 (FP16_TFLite GPU Delegate)	2.949
SWAT_3_2	Sliding Window, VAB Attention(large receptive field=17), Partial Conv(point_wise: standard conv, depth_wise: group conv), Channel Shuffle(mix_ratio=2), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.74189	0.7742521	26,016	32.4 (FP16_TFLite GPU Delegate)	2.996
SWAT_4	Sliding Window, VAB Attention, Replace partial conv with standard convlution, Remove Channel Shuffle, replace pixel normalization with layer normalization	REDS	27.785185	0.77523285	53,696	38.5 (FP16_TFLite GPU Delegate)	6.202
SWAT_5	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 250,000	REDS	27.811176	0.7763541	25,664	27.6 (FP16_TFLite GPU Delegate)	2.949
SWAT_6	Sliding Window, VAB Attention, Partial Conv, Modified Channel Shuffle (mix_ratio:1), Remove convs of hidden forward/backward	REDS	27.738842	0.7743317	21,056	- (FP16_TFLite GPU Delegate)	2.417
SWAT_7	Sliding Window, 3 branchs VAB Attention, Partial Conv, Remove Channel Shuffle, Replace pixel normalization with layer normalization	REDS	27.645552	0.77121794	18,144	- (FP16_TFLite GPU Delegate)	2.090
SWAT_8	Sliding Window, VAB Attention modified 2	REDS	27.782675	0.77573705	45,424	- (FP16_TFLite GPU Delegate)	5.200
SWAT_9	Sliding Window, Non Activation Block	REDS	27.636255	0.7709387	23,648	288.0 (FP16_TFLite GPU Delegate)	2.113

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: INT8/FP16
Acceleration: CPU/TFLite GPU Delegate

Milestone_1

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
SWAT_3_3	Sliding Window, VAB Attention(large reception field=13 with channel shuffle[Dense(unints)]), Partial Conv(standard conv), Replace pixel normalization with layer normalization	REDS	27.761633	0.7752705	27,472	30.3 (FP16_TFLite GPU Delegate)	3.165
SWAT_3_4	Sliding Window, VAB Attention(large reception field=13 without channel shuffle[Dense(unints)], stack 2 blocks), Partial Conv(standard conv), Replace pixel normalization with layer normalization	REDS	27.80347	0.77701694	32,832	40.9 (FP16_TFLite GPU Delegate)	3.798
SWAT_3_5	Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization	REDS	27.840628	0.7774375	37,312	39.4 (FP16_TFLite GPU Delegate)	4.302
SWAT_3_6	Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization, Shallow feature extraction using standard conv	REDS	27.8165	0.7774126	42,624	40.7 (FP16_TFLite GPU Delegate)	4.916
SWAT_3_7	Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization, Remove concat and unpack of hidden state	REDS	27.182861	0.7562948	29,136	29.0 (FP16_TFLite GPU Delegate)	3.357
SWAT_3_8	Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization, Remove concat and unpack of hidden state, Increase channels of fusion attention	REDS	27.564552	0.7688081	56,032	39.1 (FP16_TFLite GPU Delegate)	6.456
SWAT_3_9	Sliding Window, VAB Attention and IMDB hybrid	REDS	27.95189	0.7806478	53,312	42.8 (FP16_TFLite GPU Delegate)	6.367
SWAT_3_10	Sliding Window, Finetuned VAB Attention	REDS	27.846352	0.77762717	53,512	49.0 (FP16_TFLite GPU Delegate)	6.170
ABPN_0	Origin	REDS	27.92307	0.779504	62,048	38.1/35.7 (INT8/FP16_TFLite GPU Delegate)	7.137
ABPN_1	GenMedia Group Modified(L1 Charbonnier loss; crop_size:64)	REDS	27.858198	0.7780704	58,304	37.1/33.0 (INT8/FP16_TFLite GPU Delegate)	6.699
ABPN_2	GenMedia Group Modified(MAE loss; crop_size:96)	REDS	27.875465	0.7783027	58,304	37.1/33.0 (INT8/FP16_TFLite GPU Delegate)	6.699
AFAVSR_0	Multiple frames aggregation attention (num_feat=48, d_atten=64, num_blocks=2)	REDS	27.837406	0.77741796	68,368	44.5 (FP16_TFLite GPU Delegate)	7.872
AFAVSR_1	Multiple frames aggregation attention (num_feat=16, d_atten=32, num_blocks=8)	REDS	27.829765	0.7763255	44,016	36.6 (FP16_TFLite GPU Delegate)	5.069
AFAVSR_2	Multiple frames aggregation attention (num_feat=16, d_atten=32, num_blocks=2)	REDS				(FP16_TFLite GPU Delegate)
AFAVSR_3	All batch frames aggregation attention (num_feat=32, d_atten=64, num_blocks=2)	REDS	-	-	-	-	-
SORT_0	Sliding Window, IMDB	REDS	27.738451	0.77409536	17,356	20.6 (FP16_TFLite GPU Delegate)	2.084
SORT_1	Sliding Window, IMDB, ConvTail num_out_channel=48	REDS	27.75588	0.7749552	19,660	21.6 (FP16_TFLite GPU Delegate)	2.351
SORT_2	Sliding Window, IMDB, multi-branch distillation channel num hyperparameter tunning	REDS	27.93981	0.7808094	45,264	35.6 (FP16_TFLite GPU Delegate)	5.385
SORT_3	Sliding Window, IMDB, multi-branch distillation channel num hyperparameter tunning, Replace SEL with CCA( Contrast-Aware Channel Attention)	REDS	27.867216	0.7790734	39,144	35.3 (FP16_TFLite GPU Delegate)	4.414
SORT_4	Sliding Window, Modified IMDB equipped with channel attention mechanism	REDS	27.769545	0.7755401	39,120	41.3(FP16_TFLite GPU Delegate)	5.725
SORT_5	Sliding Window, Modified IMDB equipped with larger channel width and channel reduction/aggregation using 1*1 convs	REDS	28.13419	0.78656757	166,944	85.9(FP16_TFLite GPU Delegate)	19.566
SORT_6	Sliding Window, Modified IMDB equipped with dynamic channel width	REDS	27.944357	0.7809873	48,216	- (FP16_TFLite GPU Delegate)	-

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: INT8/FP16
Acceleration: CPU/TFLite GPU Delegate

Milestone_2

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
VSR_0	Sliding Window, Non Activation Block	REDS	27.673386	0.7725643	26,368	57.9 (FP16_TFLite GPU Delegate)	2.417
VSR_1	Attention Alignment_0, Non Activation Block	REDS	27.508242	0.76671493	17,440	42.0 (FP16_TFLite GPU Delegate)	1.677
VSR_2	Attention Alignment_1, Non Activation Block,Rectify BSConvolution	REDS	27.53437	0.7678055	17,776	error (FP16_TFLite GPU Delegate)	2.035
VSR_3	VSR_2 Ablation: Attention Alignment_1	REDS	27.414068	0.76361054	17,413	60.3 (FP16_TFLite GPU Delegate)	1.793
VSR_4	VSR_2 -> modify Non Activation Block using partial conv	REDS	27.784992	0.7769825	43,120	error (FP16_TFLite GPU Delegate)	4.958
VSR_5	VSR_4 Ablation: RGB out channels sharing upsample result	REDS	27.835686	0.7776796	47,728	- (FP16_TFLite GPU Delegate)	5.491
VSR_6	VSR_5 Finetune: Non Activation Block channel numbers modify	REDS	27.783165	0.7768693	28,976	error (FP16_TFLite GPU Delegate)	3.699
VSR_7	Light weight hidden states attention alignment; Blue Print convolution for shallow feature extraction; Multi-Stage ExcavatoR(MSER) combined with partial convolution and simplified channel attention	REDS	27.470276	0.7664948	81,806	66.1 (FP16_TFLite GPU Delegate)	7.938
VSR_8	Light weight hidden states attention alignment; Blue Print convolution for shallow feature extraction; Nonlinear activation free block	REDS	27.91092	0.77971315	66,312	64.7 (FP16_TFLite GPU Delegate)	7.269
VSR_9	vsr_9 ablation: feature alignment	REDS	27.91092	0.77971315	39,792	44.5 (FP16_TFLite GPU Delegate)	4.218
VSR_10	motivation: IMDB + PartialConv + VapSR + BSConv	REDS	27.963232	0.780958	44,256	48.6 (FP16_TFLite GPU Delegate)	5.103
VSR_11	VSR_10 ablation: hidden state conv using bias	REDS	27.948818	0.7809571	44,288	47.2 (FP16_TFLite GPU Delegate)	5.103
VSR_12	VSR_10 ablation: hidden state process using modified IMDB	REDS	27.953104	0.7807622	57,696	62.8 (FP16_TFLite GPU Delegate)	6.649

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: INT8/FP16
Acceleration: CPU/TFLite GPU Delegate

Milestone_3

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
MVSR_0	modified IMDB IMDB + PartialConv + VapSR + BSConv; deprecate hidden state forward and backward; light weight feature alignment	REDS	27.915539	0.7799377	35,777	38.8 (FP16_TFLite GPU Delegate)	4.068
MVSR_1	modified IMDB IMDB + PartialConv + VapSR + BSConv; deprecate hidden state forward and backward; light weight frame alignment	REDS	27.932716	0.7810435	34,473	44.3 (FP16_TFLite GPU Delegate)	3.976
MVSR_2	MVSR_1 Ablation: light weight frame alignment	REDS	27.929586	0.78039753	34,208	35.4 (FP16_TFLite GPU Delegate)	3.944
MVSR_3	MVSR_1 Ablation: large receptive field in SMDB -> reduce: 3x3 + 3x3 dilated	REDS	27.892586	0.7790079	32,169	41.1 (FP16_TFLite GPU Delegate)	3.711
MVSR_4	MVSR_2 Ablation: large receptive field in SMDB -> increase: 7x7 + 7x7 dilated	REDS	27.958328	0.78145003	37,664	42.4 (FP16_TFLite GPU Delegate)	4.343
MVSR_5	MVSR_1 Ablation: large receptive field in SMDB -> increase: 7x7 + 7x7 dilated	REDS	27.936714	0.7809204	37,929	49.8 (FP16_TFLite GPU Delegate)	4.375
MVSR_6	modified IMDB IMDB + PartialConv based pixel attention version_0 + VapSR + BSConv; light weight frame alignment	REDS	27.884369	0.7790964	34,473	44.4 (FP16_TFLite GPU Delegate)	4.246
MVSR_7	modified IMDB IMDB + PartialConv based pixel attention version_1 + VapSR + BSConv; light weight frame alignment	REDS	27.858534	0.77831227	35,769	44.5 (FP16_TFLite GPU Delegate)	4.387
MVSR_8	MVSR_1 Ablation: SEL -> Channel Attention	REDS	27.610485	0.7696045	29,145	40.5 (FP16_TFLite GPU Delegate)	3.001
MVSR_9	MVSR_1 Ablation: Channel fuse + SEL -> FlashModule + Channel fuse	REDS	28.043566	0.7842476	96,249	72.8 (FP16_TFLite GPU Delegate)	10.684
MVSR_10	Partial conv idea applied to MSDB and Attention(i.e. SEL)	REDS	27.86422	0.7783118	27,081	41.2 (FP16_TFLite GPU Delegate)	3.031
MVSR_11	MVSR_10 fintune: deperecae MSDB’s channel fuse; add MDSB blocks	REDS	27.90566	0.7793118	32,553	48.7 (FP16_TFLite GPU Delegate)	3.634
MVSR_12	MVSR_11 ablation: MSDB’s group convolution -> standard convolution	REDS	27.953104	0.7807622	68,169	38.4 (FP16_TFLite GPU Delegate)	7.737
MVSR_13	MVSR_12 AttentionAlign module evolution	REDS	27.966156	0.7809557	68,157	39.8 (FP16_TFLite GPU Delegate)	7.735
MVSR_13_1	MVSR_13 evolution: ConvTail used for increasing dimension -> BSConv	REDS	27.879667	0.7790071	62,541	40.6 (FP16_TFLite GPU Delegate)	7.080
MVSR_13_2	MVSR_13 ablation: fractional/partial ratio 1/2 -> 1/4	REDS	27.877321	0.7783568	37,517	37.1 (FP16_TFLite GPU Delegate)	4.152
MVSR_13_3	MVSR_13 ablation: fractional/partial ratio 1/2 -> 1/8	REDS	27.79014	0.77567685	29,829	35.5 (FP16_TFLite GPU Delegate)	3.240
MVSR_13_4	MVSR_13 ablation: fractional/partial ratio 1/2 -> 3/4	REDS	27.955465	0.78150684	119,149	84.1 (FP16_TFLite GPU Delegate)	13.663
MVSR_13_4_revalid	MVSR_13 ablation: fractional/partial ratio 1/2 -> 3/4	REDS	27.956861	0.7814712	119,149	84.1 (FP16_TFLite GPU Delegate)	13.663
MVSR_13_5	MVSR_13 ablation: fractional/partial ratio 1/2 -> 7/8	REDS	27.993414	0.7823691	152,277	103.0 (FP16_TFLite GPU Delegate)	17.506
MVSR_13_6	MVSR_13 ablation: fractional/partial ratio 1/2 -> 3/8	REDS	27.948065	0.78071	50,293	39.0 (FP16_TFLite GPU Delegate)	5.651
MVSR_13_7	MVSR_13 ablation: fractional/partial ratio 1/2 -> 5/8	REDS	27.983498	0.7823881	91,109	86.4 (FP16_TFLite GPU Delegate)	10.406
MVSR_14	MVSR_13 ablation: - frame attention align -> standard conv 1 x 1 act as frame information propogation operator	REDS	27.930904	0.78043896	68,272	29.7 (FP16_TFLite GPU Delegate)	7.746
MVSR_15	MVSR_13 ablation: MSDB block number 4 -> 3	REDS	27.91523	0.77966064	52,965	33.4 (FP16_TFLite GPU Delegate)	6.016
MVSR_16	MVSR_13 ablation: No partial/fractional; No BSconv (Blueprint Separable conv); No receptive field decomposition	REDS	27.928417	0.7801167	920,381	399.0 (FP16_TFLite GPU Delegate)	106.045
MVSR_17	MVSR_13 evolution: MSDB using standard conv 3 x 3, PPA using split large receptive field conv 5 x 5 + 5 x 5 dilated	REDS	27.902325	0.7794845	47,101	36.8 (FP16_TFLite GPU Delegate)	5.313
MVSR_18	MVSR_17 ablation: BSconv	REDS	27.893446	0.77926654	47,325	34.4 (FP16_TFLite GPU Delegate)	5.340
MVSR_19	MVSR_13 evolution: MSDB blocks 4 -> 3; Enlarge receptive field of PPA 3 -> 17	REDS	27.914143	0.7799854	60,861	35.9 (FP16_TFLite GPU Delegate)	6.924
MVSR_20	MVSR_13 ablation: No receptive field decomposition	REDS	27.93524	0.78071207	251,613	119.0 (FP16_TFLite GPU Delegate)	28.875
MVSR_21	MVSR_13 ablation: No frame align; No fractional/partial; No BSconv; No receptive field decomposition	REDS	27.941408	0.7807615	920,128	400.0 (FP16_TFLite GPU Delegate)	106.014
MVSR_21_1	MVSR_13 ablation: No frame align (directly extraction from 3 consecutive frames); No fractional/partial; No BSconv; No receptive field decomposition	REDS	27.913836	0.779473	920,992	399.0 (FP16_TFLite GPU Delegate)	106.114
MVSR_22	MVSR_13 ablation: No BSconv; No receptive field decomposition	REDS	27.936152	0.7799803	251,837	118.0 (FP16_TFLite GPU Delegate)	28.902
MVSR_23	MVSR_13 ablation: PFE PPA Standard conv -> Depthwise conv	REDS	27.867388	0.7784562	32,541	49.5 (FP16_TFLite GPU Delegate)	3.632
MVSR_24	MVSR_13 ablation: - Partial/Fractional Extraction	REDS	27.952333	0.78090274	186,141	94.3 (FP16_TFLite GPU Delegate)	21.449
MVSR_24_revalid	MVSR_13 ablation: - Partial/Fractional Extraction (keep fc)	REDS	27.940563	0.7804851	190,493	102.0 (FP16_TFLite GPU Delegate)	21.935
MVSR_25	MVSR_13 ablation: - BSConv	REDS	27.929697	0.780183	68,381	38.8 (FP16_TFLite GPU Delegate)	7.762
MVSR_26	MVSR_13 ablation: - Large Receptive Field Decomposition	REDS	27.972654	0.7818956	251,613	119.0 (FP16_TFLite GPU Delegate)	28.875
MVSR_27	MVSR_13 ablation: - FC in PFE, PPA	REDS	27.945955	0.78067327	63,805	37.8 (FP16_TFLite GPU Delegate)	7.249

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: INT8/FP16
Acceleration: CPU/TFLite GPU Delegate

Benchmark_0

Rank	Model	Source	Dataset	Test PSNR	Test SSIM	Params	Runtime on oneplus7T [ms]
1	Diggers	Real-Time Video Super-Resolution based on Bidirectional RNNs(2021 SOTA)	REDS(train_videos: 240, test_videos: 30)	27.98	-	39,640	-
2	VSR_12	Ours	REDS(train_videos: 240, test_videos: 30)	27.981062	0.7824855	57,696	62.8
3	MVSR_4	Ours	REDS(train_videos: 240, test_videos: 30)	27.958328	0.78145003	37,664	42.4
4	MVSR_12	Ours	REDS(train_videos: 240, test_videos: 30)	27.953104	0.7807622	68,169	38.4
5	SORT_2	Ours	REDS(train_videos: 240, test_videos: 30)	27.93981	0.7808094	45,264	35.6
6	SWRN	Sliding Window Recurrent Network for Efficient Video Super-Resolution (2022 SOTA)	REDS(train_videos: 240, test_videos: 30)	27.92	0.77	43,472	31.0
7	MVSR_11	Ours	REDS(train_videos: 240, test_videos: 30)	27.90566	0.7793118	32,553	48.7
8	SWAT_3_5	Ours	REDS(train_videos: 240, test_videos: 30)	27.840628	0.7774375	37,312	39.4
9	EESRNet	EESRNet: A Network for Energy Efficient Super-Resolution(2022)	REDS(train_videos: 240, test_videos: 30)	27.84	-	62,550	-
10	LiDeR	LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices (2022)	REDS(train_videos: 240, test_videos: 30)	27.51	0.76	-	-
11	EVSRNet	EVSRNet：Efficient Video Super-Resolution with Neural Architecture Search(2021)	REDS(train_videos: 240, test_videos: 30)	27.42	-	-	-
12	RCBSR	RCBSR: Re-parameterization Convolution Block for Super-Resolution(2022)	REDS(train_videos: 240, test_videos: 30)	27.28	0.775	-	-

Benchmark_1

Model	Source	Dataset	Test PSNR	Test SSIM	Params
SSL-uni	Structured Sparsity Learning for Efficient Video Super-Resolution (CVPR2023)	REDS(train:266 test:4)	30.24	0.86	500,000

PaperWriting

No.1

BSConvU as shallow feature extraction
Recurrent neural network for feature information freedom flow cross frames
multi distilation module through dynamic routing large ERF attention
Bilineared RGB channels share same upsample result
Nearest conv for shorter residual inference time compared with bilinear residual

No.2

Motivation: 移动端视频超分 Inference Time ↓, PSNR ↑, SSIM ↑
只用当前处理LR帧的前一个预测HR帧做参考补偿当前帧 -> 拍摄的同时实时超分,不受只能对拍摄完成的视频进行超分的限制
假设模型中间的feature maps对输出结果不是同等贡献度，如何进行高贡献度的feature maps聚集aggregation -> 做Partial Convolution accelerate inference(分析)
减少模型中的activation -> 利用Multiply产生非线性映射的能力
RGB三通道共享上采样补偿 -> 常规模型的RGB三通道上采样补偿是否存在高度一致性，若存在则可以共享以起到降低计算量加速推理的效果(分析)
蓝图卷积作为浅层特征提取 -> 效果反而比标准卷积最终的效果好
多尺度特征(降采样到不同尺度)基于注意力机制融合 <- motivation: 灵长类动物视觉皮层同一区域不同神经元感受野不同，类比到模型内则是同一层内从不同尺度/感受野捕获更精确的空间信息或更多的纹理信息
短距离shortcut的fusion -> 加速推理

No.3

Motivation: 移动端视频超分 Inference Time ↓, PSNR ↑, SSIM ↑
辅助前后向传播的隐藏状态做对齐(auxiliary forward/backward hidden states for feature alignment) -> 提升超分结果PSNR
假设模型中间的feature maps对输出结果不是同等贡献度，如何进行高贡献度的feature maps聚集aggregation -> 做Partial Convolution accelerate inference(分析)
减少模型中的activation -> 利用Multiply产生非线性映射的能力,加速推理
考虑动态深度(adaptive existing) -> 加速推理 -> deprecated

PaperReference

Rethinking Alignment in Video Super-Resolution Transformers(NIPS 2022) -> VIT 视频超分(VSR)中帧/特征对齐不是必要操作
An Implicit Alignment for Video Super-Resolution (ArXiv 2023) -> bilinear interpolation/resample 改进
Video Super-Resolution Transformer
Efficient Reference-based Video Super-Resolution (ERVSR): Single Reference Image Is All You Need (WACV 2023) -> 帧序列中间帧作为参考帧辅助当前帧超分
MULTI-STAGE FEATURE ALIGNMENT NETWORK FOR VIDEO SUPER-RESOLUTION
ELSR: Extreme Low-Power Super Resolution Network For Mobile Devices
LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices
Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling
COLLAPSIBLE LINEAR BLOCKS FOR SUPER-EFFICIENT SUPER RESOLUTION
Revisiting Temporal Alignment for Video Restoration
BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment
EVSRNet：Efficient Video Super-Resolution with Neural Architecture Search
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond
Revisiting Temporal Modeling for Video Super-resolution -> MAI 第一届VSR 官方baseline
TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution (CVPR 2020)
Video Super-resolution with Temporal Group Attention (CVPR 2020)
3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks
Frame-Recurrent Video Super-Resolution
Video Super-Resolution With Convolutional Neural Networks

Model Quantization Papers

发表于 2022-12-06 更新于 2024-12-02

鼻祖：Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
综述：
1. Quantizing deep convolutional networks for efficient inference: A whitepaper
2. A White Paper on Neural Network Quantization
上手：
1. ZeroQ: A Novel Zero Shot Quantization Framework
2. HAWQ-V3: Dyadic Neural Network Quantization
3. Up or Down? Adaptive Rounding for Post-Training Quantization

Title	Class
ZeroQ	DFQ
SQuant	DFQ
ACIQ	PTQ
GDFQ	DFQ

Logs of killing issues

发表于 2022-11-17 更新于 2024-12-02

Issue	Method
PytorchCV package	Image Classification and Segmentation Models
Parameter-Argument	Defining-Calling
Positional	传参时前面不带 “变量名=”, 顺序不可变
Keyword	传参时前面加上 “变量名=”, 顺序可变
Class	实例化后用self指代
Method	类中定义的函数
Self	类的方法与普通的函数只有一个特别的区别——必须有一个额外的第一个参数名称, 按照惯例它的名称是self
model.modules(), model.named_modules(), model.children(), model.named_children(), model.parameters()	返回iterable可遍历；具有__iter__()或__getitem__()方法的对象，Python就认为它是一个iterable
卷积神经网络基本原理	MAC(Multiply Accumulates)乘加
FLOPs	is abbreviation of floating operations which includes mul / add / div … etc.
FLOPS	floating point operations per second
MACs	stands for multiply–accumulate operation that performs a <- a + (b x c).
《TVM: End-to-End Optimization Stack for Deep Learning》	TVM是陈天奇领导的一个DL加速框架项目。它处于DL框架（如tensorflow、pytorch）和硬件后端（如CUDA、OpenCL）之间，兼顾了前者的易用性和后者的执行效率。
Pytorch Hook 函数	hook 函数用以获取我们不方便获得的一些中间变量
Magic Method: __call__()	将对象当方法使用
Magic Method: __new__()	创建类实例的静态方法
Magic Method: __repr__()	直接输出某个实例化对象，默认情况下输出是“类名+object at+内存地址”，可重写获得想要的属性信息
Magic Method: __del__()	销毁对象
Magic Method: __dir__()	列出对象的所有属性名、方法名
Magic Method: __dict__()	查看对象内部所有属性名和属性值组成的字典dict
conda配置文件.condarc	.condarc是conda 应用程序的配置文件，在用户家目录（windows：C:usersusername，linux：/home/username/）
conda查看配置	conda config –show
conda添加更新镜像源	conda config –add channels …
conda删除更新镜像源	conda config –remove channels …
conda 代理	conda config –set proxy_servers.http … config –set proxy_servers.https …
深度学习—激活函数	Sigmoid、tanh、ReLU、ReLU6及变体P-R-Leaky、ELU、SELU、Swish、Mish、Maxout、hard-sigmoid、hard-swish
CUDA Toolkit	Nvidia 官方提供的 CUDA Toolkit 是一个完整的工具安装包，其中提供了 Nvidia 驱动程序、开发 CUDA 程序相关的开发工具包等可供安装的选项
cudatoolkit	Anaconda 在安装 Pytorch 等会使用到 CUDA 的框架时，会自动为用户安装 cudatoolkit，其主要包含应用程序在使用 CUDA 相关的功能时所依赖的动态链接库。在安装了 cudatoolkit 后，只要系统上存在与当前的 cudatoolkit 所兼容的 Nvidia 驱动，则已经编译好的 CUDA 相关的程序就可以直接运行，而不需要安装完整的 Nvidia 官方提供的 CUDA Toolkit
linux之ls -l命令	得到一个目录下的文件和子目录的详细信息，一共包含9列
Linux中bashrc位置	/etc/.bashrc
.bashrc用途	个性化指令；设置环境变量,所有环境变量名都是大写，Linux区分大小写
.bashrc 路径修改	“export PATH=$PATH:路径” ，在原来PATH的后面继续添加了新的路径，在运行特定指令时，系统会逐个位置去寻找文件。 $PATH 表示原先设定的路径，不能遗漏。不同于DOS/Windows，Unix类系统的环境变量的路径用冒号:分割，而不是分号;
.bashrc修改生效	source /etc/.bashrc
.bashrc文件没了怎么办？	从如下路径拷贝一份原始的.bashrc文件到用户home目录下:cp /etc/skel/.bashrc ~/
nvcc: command not found	1. nvcc安装在/usr/local/cuda/bin；2.添加路径 export LD_LIBRARY_PATH=/usr/local/cuda/lib；export PATH=$PATH:/usr/local/cuda/bin；3. 更新配置文件 source ~/.bashrc
linux中的“~”、“/”、“./”	~” ：表示主目录，也就是当前登录用户的用户目录。“/” ：是指根目录：就是所有目录最顶层的目录。“./” ：表示当前目录。“..” ：表示上级目录
nvcc -V	查看当前CUDA的版本，即实际安装的CUDA版本
nvidia-smi	不仅可以查看当前NVIDIA驱动的版本，还可以查询与此驱动相匹配的CUDA版本，虽是匹配，但是CUDA的版本可以略低于此时驱动匹配的CUDA版本，因此，我们可以安装版本高一点的驱动，来兼容不同版本的CUDA！
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running	内核自动升级导致的新内核无法启动驱动, 最终重装显卡驱动, cuda, cudnn解决，参考如下链接：https://qii404.me/2021/07/03/ubuntu-install-nvidia-driver.html
linux 命令 wq	write and quit
linux 命令 wq!	forcely write and quit
linux 命令 x	storage and quit
linux netcat	查看端口 22 是否在主机 192.168.56.10 上打开 –> nc -zv 192.168.1.15 22
linux 解压缩zip文件	unzip…
linux下路径名中含空格如何处理	1.使用转义字符“\” 2.将路径名加双引号”” 或单引号‘’
linux sudo: command not found	apt-get install sudo
linux 命令 echo	1.打印输出 echo -e “hello\tworld” 2.覆盖echo Hello World > log.txt3.追加 echo Hello World >> log.txt
解决每次打开终端都需要source .bashrc问题	登录Linux时，执行顺序可以总结为：/etc/profile→ ( ~/.bash_profile \| ~/.bash_login \| ~/.profile)→ ~/.bashrc →/etc/bashrc 故只需在/etc/profile 或 ~/.bash_profile 文件中添加：

#!/bin/bash
# ~/.bash_profile
if [ -f ~/.bashrc ]; then
    . ~/.bashrc                
fi 

# /etc/profile
 if [ -f /etc/.bashrc ]; then 
     . /etc/.bashrc
 fi

Issue	Method
args 是 arguments 的缩写	args就是就是传递一个可变参数列表给函数实参，args 必须放在 **kwargs 的前面
kwargs 是 keyword arguments 的缩写	**kwargs则是将一个可变的关键字参数的字典传给函数实参
Python startswith()	检查字符串是否是以指定子字符串开头
Python os.path.join()	用于路径拼接，可以传入多个路径
Python tuple	元组：有序且不可更改的集合，tup=(1,2,3,4)
Python list	列表：有序，list=[1,2,3,4]
Python dictionary	字典：无序，dic={‘a’:12,‘b’:34}
Python set()	创建一个无序不重复元素集
Python lambda	lambda 函数是一种小的匿名函数。lambda 函数可接受任意数量的参数，但只能有一个表达式。语法 lambda arguments : expression
Python pdb	pdb是ptyhon内置的一个调试库
python print 格式符%	print (“His name is %s”%(“Aviad”))；print (“He is %d years old”%(25))；print (“His height is %.2f m”%(1.83234)
python print 格式化字符串f	print(f’{A}的类型为{type(A)}’)
python print 转义字符 \	\n表示换行，\t表示制表符，\r表示回车，\f表示换页
python print print+format组合	print(“{1} {0} {1}”.format(“hello”, “world”) ) # 设置指定位置,输出为’world hello world’
Python3 assert	语法格式：assert expression 等价于if not expression: raise AssertionError
Python logging库	常用的记录日志库
Python Try…Except	Debug
Python a is b	a is b , 这是一个同一性运算符。用于比较两个对象的物理id。如果相同则返回True否则返回False.Python为了优化效率,内置了小整数对象池和简单字符串对象池。小整数对象池包括[-5, 256]。两变量如a=2 b=2,a is b –>return TRUE
Python a == b	a == b , 这是一个比较运算符,用于比较两个对象的value(值)是否相同,相同则返回True 否则返回False
python super().init()	super().init() 就是调用父类的init方法，同样可以使用super()去调用父类的其他方法。
torch.arange(start,end)	produces values in [start, end)
Python sum(iterable,start)	sum=iterable的和+start的值
Python “name“	__name__是python的一个内置类属性，它存储模块的名称。python的模块既可以被调用，也可以独立运行。而被调用时__name__存储的是py文件名(模块名称)，独立运行时存储的是”main“。
Python list(set(a))	set(a)将列表a转换为集合，集合是一个包含不重复元素的无序序列，然后再使用list将集合转换为列表
Python apply()	apply(func,args,*kwargs)
Python *	*用在tuple变量之前作为函数参数，可将tuple/list 转化为多个参数传入函数
Python **	调用函数时，**用在dict变量之前作为函数参数，可将dict转化为多个关键字参数传入函数
Python 单下划线开头	半私有变量
Python 双下划线开头	私有变量
Python 双下划线开头、双下划线结尾	Python内置属性名或者魔法方法名。是Python自己实现的属性和方法，一般不允许自定义类似此种命名方式的属性或者方法。
Python copy()	不管多么复杂的数据结构，浅拷贝都只会copy一层。如列表是三层表示的，类似c中指针的指针
Python deepcopy()	将整个变量内存全部复制一遍，新变量与原变量没有任何关系。
Python import…	导入一个模块，使用：模块.函数
Python from…import…	导入了一个模块中的一个函数，使用：直接使用函数名使用就可以了
Python @staticmethod	静态方法。不传入代表实例对象的self参数，并且不强制要求传递任何参数，可以被类直接调用。静态方法是独立于类的一个单独函数，只是寄存在一个类名下。静态方法就是类对外部函数的封装，有助于优化代码结构和提高程序的可读性。
Python @classmethod	类方法。不传入self示例本身，而是传入cls，代表这个类自身，可以来调用类的属性，类的方法，实例化对象等。类方法是将类本身作为操作对象。当我们需要和类直接进行交互，而不需要和实例进行交互时，自然也就不需要传入实例本身
Python @abstractmethod	抽象方法。用于程序接口的控制。含有abstractmethod 方法的类不能实例化，继承了含abstractmethod方法的子类必须复写所有abstractmethod装饰的方法，未被装饰的不重写
Python @property	将一个方法伪装成属性。被修饰的特性方法，内部可以实现处理逻辑，但对外提供统一的调用方式（访问方式很友好）
Python class	类：采用 Class 作为关键字进行定义的代码块，表示的是一种类别
Python object	对象：实例化之后的类，对类中的形参进行了赋值，赋予其真正的含义或数值
Python method	方法：使用 def 作为关键词，定义在类内的函数
Python function	函数：使用 def 作为关键词，但是没有在类内进行定义，即定义在类外
Python attribute	属性：类内的称呼，其实就是类内的变量，同一个类内的不同方法内的变量都是这个类的属性，也就是这个类的变量
Python None	与C不同，在python中是没有NULL，但存在相近意义的None。None表示空值，它是一个特殊 Python 对象, None的类型是NoneType

Issue	Method
torch.nn.Parameter()	torch.nn.Parameter是继承自torch.Tensor的子类，其主要作用是作为nn.Module中的可训练参数使用。它与torch.Tensor的区别就是nn.Parameter会自动被认为是module的可训练参数，即加入到parameter()这个迭代器中去；而module中非nn.Parameter()的普通tensor是不在parameter中的。注意到，nn.Parameter的对象的requires_grad属性的默认值是True，即是可被训练的，这与torth.Tensor对象的默认值相反。
Pytorch .item()	.item()用于在只包含一个元素的tensor中提取值，注意是只包含一个元素，否则的话使用.tolist()
Pytorch model.train()	启用 Batch Normalization 和 Dropout。
Pytorch model.eval()	不启用 Batch Normalization 和 Dropout。eval模式不会影响各层的gradient计算行为，即gradient计算和存储与training模式一样，只是不进行反向传播（back probagation)。
Pytorch torch.no_grad()	with torch.no_grad()则主要是用于停止autograd模块的工作，以起到加速和节省显存的作用。它的作用是将该with语句包裹起来的部分停止梯度的更新，从而节省了GPU算力和显存，但是并不会影响dropout和BN层的行为。
Pytorch torch.max(input,dim)	dim：input每个元素参与比较的维度
Pytorch torch.tensor()	Constructs a tensor with no autograd history (also known as a “leaf tensor”, see Autograd mechanics) by copying data.
Pytorch torch.Tensor()	A torch.Tensor is a multi-dimensional matrix containing elements of a single data type.
Pytorch torch.autograd.Variable()	Autograd的核心类，浅封装（thin wrapper）了Tensor，用于整合实现反向传播。torch0.4后张量与自动微分变量整合，tensor直接当作自动微分变量使用，旦Variable仍可使用
Pytorch 自定义autograd中的Function	自定义pytorch中动态图的算子(operator)，也就是动态图的“边”，需要继承torch.autograd.Function类，并实现forward与backward方法。在使用自定义的算子时，需要使用apply方法。
Pytorch torch.save(net,path)	保存模型,模型=网络结构+网络参数
Pytorch torch.save(net.state_dict(),path)	保存网络参数
Pytorch中什么时候调用forward()函数	Module类是nn模块里提供的一个模型构造类，是所有神经网络模块的基类，我们可以继承它来定义我们想要的模型。Module中定义了__call__()函数，该函数调用了forward()函数，前向传播时会自动调用__call__()函数亦即自动调用forward()
Pytorch nn.Sequential()	把定义的conv fc relu等层包装起来作为一个整体
Pytorch torch.squeeze()	torch.squeeze(input, dim=None, *, out=None) → Tensor 对输入的张量进行处理，如果张量维度里面有大小为1 的部分，那我们就移除，否则保留.dim可以指定特定的某一维度判断是否为1并进行压缩，若不指定则对input_tensor所有为1的维度进行压缩
Pytorch import torch import torch.nn as nn	起到缩写效果：如果只用import torch，就要用torch.nn.Conv2d这样的代码。如果写成import torch.nn as nn，后面就可以简写成nn.Con2d。两种写法效果都一样，用import …as… 只是起了个别名写代码时可以更精炼。
Pytorch torch.cuda.is_available()	查看是否有可用GPU
Pytorch torch.cuda.device_count()	查看GPU数量
Pytorch torch.cuda.current_device()	查看当前使用的cuda编号
Pytorch torch.cuda.get_device_capability(device)	查看指定GPU容量
Pytorch torch.cuda.get_device_name(device)	查看指定GPU名称
Pytorch torch.cuda.manual_seed(seed)	设置随机种子
Pytorch register_buffer()	model中需要设置一些不更新的参数,同时希望通过model.state_dict()将参数保存下来，就用到register_buffer(),buffer也可以通过requires_grad获取其梯度信息，但是optimizer进行更新的是parameter,buffer不会更新

Issue	Method
深度学习之embedding层	通过矩阵乘法实现降维，信息不变，按照某种映射关系将原本矩阵的信息转换到了一个新的维度的矩阵里面，节省存储空间。也可以逆向升维
ImageNet 数据集	ImageNet 是一个计算机视觉系统识别项目，是目前世界上最大的图像识别数据库。此项目由斯坦福大学李飞飞等教授于 2009 年发起.ImageNet 中目前共有 14,197,122 幅图像，总共分为 21,841 个类别（synsets），通常我们所说的 ImageNet 数据集其实是指 ISLVRC2012 比赛用的子数据集，其中 train 有 1,281,167 张照片和标签，共 1000 类，大概每类 1300 张图片，val 有 50,000 副图像，每类 50 个数据，test 有 100,000 副图片，每类 100 个数据。
海森矩阵	Hessian Matrix 二阶导数矩阵
Ubuntu中切换Python版本	1.列出可用的 Python 替代版本：update-alternatives –list python 2.列出的 Python 版本中选择进行切换：update-alternatives –config python
目标检测/图像分割评价单张图片标准	IOU(Intersection-Over-Union)交并比
目标检测/图像分割评价一套算法标准	在整个数据集测试结果准确率（Pixel Accuracy）：检测出来物体占待检测总体（包含检测出和未检测出）的比例；精确度（Pixel Precision）：检测出来正确的物体占检测出物体总体的比例
LSTMs	Long Short Term Memory networks
参数量Params	input_feature_map:f=(B,c1,H,W),conv_kernel:k * k,bias=True且使用BN,即附加两个可学习参数alpha和beta, Params=c1c2kk+3c2
参数量Params	fc 输入神经元数M,输出神经元数N,bias为True时params=M*N+N
计算量FLOPs	乘加次数->输出的每个pixel的得到需要多少次乘加

Concept	Interpretation
APU（Accelerated Processing Unit，加速处理单元）	最早由AMD提出并生产制作的具有概念性的理念产品。加速芯片对数据图像的处理能力。
NPU（Neural-network Processing Unit，神经网络处理单元）	可以自行处理某些数据，将接受到的多元化的数据分担给其他单元处理
GPU（Graphics Processing Unit，图形处理单元）	专门处理图像数据，也能为CPU分担部分工作。
CPU（Central Processing Unit，中央处理单元）	系统的运算能力，电子产品的核心。负责处理指令和一切逻辑性数据。

CUDA Tutorial

发表于 2022-09-21 更新于 2024-12-02

内容

CPU体系架构概述
并行程序设计概述
CUDA开发环境搭建和工具配置
GPU体系架构概述
GPU编程模型
CUDA编程（1）
CUDA编程（2）
CUDA编程（3）
CUDA程序分析和调试工具
CUDA程序基本优化
CUDA程序深入优化
最新NVIDA GPU 和 CUDA特性

1.CPU体系概述

桌面级应用以访存分支操作数据搬来搬去为主，数值计算占比很低
取指译码执行访存，流水线 pipeline 指令级并行减小时钟周期但是增加了延迟和芯片面积。带来的问题：
- 具有依赖关系的指令执行顺序
- 分支怎么处理
- 流水线长度
  旁路 Bypassing
  停滞 Stalls
  分支 Branches
  分支预测 Branches Prediction
- +现代预测器准确度>90%
- -面积增加延迟增加
  分支断定 Predication GPU中使用了分支断定
提升IPC(instructions per cycle) 超标量(Superscalar)
- 寄存器重命名(RegisterRenaming)
- 乱序执行(Out-of-Order Execution) 重排指令获得最大吞吐率
- +IPC接近理想状态
- -面积增加功耗增加
CPU内部的并行性
- 指令级并行 Instruction-level Parallelism (ILP)
  - 超标量Superscalar
  - 乱序执行Out-of-Order
- 数据级并行 Data-level Parallelism (DLP)
  - 矢量计算Vectors
- 线程级并行 Thread-level Parallelism (TLP)
  - 同步多线程 Simultaneous Mulitithreading (SMT)
  - 多核 Multicore
  - 锁、一致性和同一性 Locks,Coherence and Consisitency
    - 问题：多线程读写同一块数据解决办法：加锁
    - 问题：谁的数据是正确的？ Coherence 解决办法：缓存一致性协议
    - 问题：什么样的数据是正确的？ Consistency 解决办法：存储器同一性模型
能量墙/存储墙

结论

CPU为串行程序优化
- Piplines, Branch Prediction, Superscalar, Out-of-Order(OoO)
- Reduce execution time with high clock speeds and high utilization
缓慢的内存带宽(存储器带宽)将会是大问题
并行处理是方向

2.并行程序设计概述

概念和名词
- Flynn 矩阵
  - SISD: Single Instruction, Single Data
  - SIMD: Single Instruction, Multiple Data
  - MISD: Multiple Instruction, Single Data
  - MIMD: Multiple Instruction, Multiple Data
- Task (任务)
- Parallel Task (并行任务)
- Serial Execution (串行执行)
- Parallel Execution (并行执行)
- Shared Memory (共享存储)
- Distributed Memory (分布式存储)
- Communications (通信)
- Synchronization (同步)
- Granularity (粒度)
- Observed Speedup (加速比)
- Parallel Overhead (并行开销)
- Scalability (可扩展性)
并行编程模型
- 共享存储模型 Shared Memory Model
- 线程模型 Threads Model
- 消息传递模型 Message Passing Model
- 数据并行模型 Data Parallel Model
设计并行处理程序和系统
- 自动和手动并行
- 理解问题和程序
- 分块分割数据分块，任务分割
- 通信可扩展性重要影响因素
- 同步
- 数据依赖
- 负载均衡
- 粒度
- I/O
- 成本
- 性能分析和优化
Amdahl’s Law
- 程序可能的加速比取决于可以被并行化的部分,并行化的可扩展性有极限取决于可并行部分的比例

3.CUDA开发环境搭建

windows cuda zone
linux

4.GPU体系架构概述

为什么需要GPU(Graphic Processing Unit)
- GPU 是异构众核处理器，针对吞吐优化
  - 高效的GPU任务具备的条件
    - 具有成千上万的独立工作
      - 尽量利用大量的ALU单元
      - 大量的片元切换掩藏延迟
    - 可以共享指令流
      - 适用于SIMD处理
    - 最好是计算密集的任务
      - 通信和计算开销比例合适
      - 不要受制于访存带宽
三种方法提升GPU的处理速度
- 1.Use many “slimmed down cores” to run in parallel
- 2.Pack cores full of ALUs(by sharing instruction stream across groups of fragments)
  - Option 1: Explicit SIMD vector instructions
  - Option 2: Implicit sharing managed by hardware
- 3.Avoid latency stalls by interleaving execution of many groups of fragments
实际GPU设计举例
- NVIDIA GTX 480：Fermi
- NVIDIA GTX 680: Kepler
GPU的存储器设计

5.GPU编程模型

内容
- CPU和GPU互动模式
- GPU线程组织模型（不停强化）
- GPU存储模型
- 基本的编程问题
CPU-GPU交互
- 各自的物理内存空间
- 通过PCIE总线互连（8GB/s~16GB/s）
- 交互开销较大
线程组织架构说明
- 一个kernel具有大量线程
- 线程被划分成线程块’Blocks’
  - 一个block内部的线程可以共享’Shared Memory’
  - 可以同步 ‘_syhcthreads()’
- kernel启动一个’grid’,包含若干线程块
  - 用户设定
- 线程和线程块具有唯一标识
编程模型
- 常规意义的GPU用于处理图形图像
- 操作用于像素，每个像素的操作都类似
- 可以应用SIMD(single instruction multiple data)
- Single Instruction Multiple Thread(SIMT)
  - GPU版本的SIMD
  - 大量线程模型获得高度并行
  - 线程切换获得延迟掩藏
  - 多个线程执行相同指令流
  - GPU上大量线程承载和调度
CUDA编程模式： Extended C
- Declspecs (Dclaration Specifier) 声明规范
  - global, device, shared, local, constant
- 关键词
  - threadIdx, blockIdx
- Intrinsics
  - __syncthreads
- 运行期API
  - Memory, symbol, execution, management
- 函数调用

6.CUDA编程(1)

GPU架构概览
- GPU特别适用于
  - 密集计算，高度可并行计算
  - 图形学
- 晶体管主要用于：
  - 执行计算
  - 而不是缓存数据，控制指令流
GPU计算的历史
- 2001/2002 研究人员把GPU当作数据并行协处理器
  - GPGPU这个新领域从此诞生
- 2007 NVIDIA发布CUDA
  - CUDA 全称Compute Uniform Device Architecture 统一计算设备架构
  - GPGPU 发展成GPU Computing
- 2008 Khronos 发布 OpenCL 规范
CUDA的一些信息
- 层次化线程集合 A hierarchy of thread groups
- 共享存储 Shared memories
- 同步 Barrier synchronization
CUDA术语
- Host - 即主机端通常指CPU
  - 采用ANSI标准C语言编程
- Device - 即设备端通常指GPU(数据可并行)
  - 采用ANSI标准C的扩展语言编程
- Host和Device 拥有各自的存储器
- CUDA编程
  - 包括主机端和设备端两部分代码
- Kernel - 数据并行处理函数，类似于OpenCL的shader
- 通过调用kernel函数在设备端创建轻量级线程
  - 线程由硬件负责创建并调度
- CUDA核函数(kernels)
  - 在N个不同的CUDA线程上并行执行
- 线程层次 Thread Hierarchies
  - Grid - 一维或多维线程块(block)
    - 一维或二维
- Block - 一组线程
  - 一维，二维或三维
    - 例如索引数组，矩阵，体
- 一个Grid内每个Block的线程数是一样的
- block内部的每个线程可以
  - 同步 synchronize
  - 访问共享存储器 shared memory
- 线程块之间彼此独立执行
  - 任意顺序：并行或串行
  - 被任意数量的处理器以任意顺序调度
  - 处理器的数量具有可扩展性
- Host 可以从device往返传输数据
  - global memory全局存储器
    - cudaMalloc() 在设备端分配global memory
    - cudaFree() 释放存储空间
    - cudaMemcpy() 内存传输
      - Host to host
      - Host to device cudaMemcpyHostToDevice
      - Device to host cudaMemcpyDeviceToHost
      - Device to device
  - Constant memory常量存储器

7.CUDA编程(2)

内置类型和函数 Built-ins and functions
- 函数的声明
  - global void KernelFunc(),返回值必须是void. Executed on the:device Only callable from the:host
  - device float DeviceFunc(),曾经默认内联，现在有些变化. Executed on the:device Only callable from the:device
  - host float HostFunc() Executed on the:host Only callable from the:host
- Global和device函数
  - 尽量少用递归（不鼓励）
  - 不要用静态变量
  - 少用malloc（现在允许但不鼓励）
  - 小心通过指针实现的函数调用
- 向量数据类型
  - type name
    - char[1-4], uchar[1-4]
    - short[1-4], ushort[1-4]
    - int[1-4], uint[1-4]
    - long[1-4], ulong[1-4]
    - longlong[1-4], ulonglong[1-4]
    - float[1-4]
    - double1, double2
  - 同时适用于host 和 device 代码
    - 通过函数make_<type name>构造
    - 通过.x, .y, .z, .w 访问
- 数学函数
  - Intrinsic function 内建函数
    - 仅面向 Device设备端
    - 更快但精度降低
    - 以__为前缀，例如：__exp, __log,__pow,…
线程同步 Synchronizing threads
- 块内线程可以同步
  - 调用__syncthreads 创建一个barrier栅栏
  - 每个线程在调用点等待块内所有线程执行到这个地方，然后所有线程继续执行后续指令
- 要求线程执行时间尽量接近 -> 防止块内大部分线程等待时间超长，降低效率
- 为什么只在一个块内同步 -> 全局同步开销大
- __syncthreads()会导致暂停死锁
线程调度 Scheduling threads
- 术语 Streaming Processor(SP) Streaming Multi-Processor(SM)
- G80架构
  - 16个SMs
  - 每个含8个SPs,总共128个SPs
  - 每个SM驻扎多达768个线程
  - 总共同时执行12,288个线程
- GT200架构
  - 30个SMs
  - 每个含8个SPs,总共含240个SPs
  - 每个SM驻扎多达8个block,或1024个线程
  - 同时执行，多达240个block，或30,720个线程
- Warp -块内的一组线程
  - G80/GT200 -32个线程
  - 运行于同一个SM
  - 线程调度的基本单位
  - threadIdx值连续
  - 一个实现细节 -理论上从硬件上保证每个warp内的线程执行到相同位置
- SM implements zero-overhead warp scheduling
  - At any time,only one of the warps is executed by SM
  - Warps whose next instruction has its operands ready for consumption are eligible for execution
  - All threads in a warp execute the same instruction when selected

存储模型 Memory model

Device code can:
R/W per-thread register
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read Only per-grid constant memory
Host code can
R/W per-grid global and constant memory
寄存器Registers
- 每个线程专用
- 快速，片上，可读写
局部存储器Local Memory
- 存储于global memory 作用域是每个线程
- 用于存储自动变量数组通过常量索引访问
共享存储器Shared Memory
- 每个块
- 快速，片上，可读写
- 全速随机访问
全局存储器Global Memory
- 长延时（100个周期）
- 片外，可读写
- 随机访问影响性能
- Host主机端可读写
常量存储器Constant Memory
- 短延时，高带宽，当所有线程访问同一位置时只读
- 存储区global memory 但是有缓存
- Host主机端可读写

变量声明	存储器	作用域	生命期
必须是单独的自动变量而不能是数组	register	thread	kernel
自动变量数组	local	thread	kernel
__shared__int sharedVar;	shared	block	kernel
__device__int globalVar;	global	grid	application
__constant__int constantVar;	constant	grid	application

重访 Matrix multiply
原子函数 Atomic functions

8.CUDA编程(3)

Model Compression Overview

发表于 2022-09-07 更新于 2024-12-02

I. INTRODUCTION

1.Designing efficient NN model architectures

present situation

手动优化微观结构如内核类型(深度卷积或低秩分解)
手动优化宏观结构如模块(residual、inception)
自动优化如Automated machine learning (AutoML) and Neural Architecture Search (NAS)

2.Co-designing NN architecture and hardware together

硬件与nn结构共同设计或者针对不同的硬件平台调整神经网络架构。主要mootivation是nn不同组件的开销是依赖于硬件的。

3.Pruning

unstructured pruning
- motivation: removes neurons with with small sensitivity, wherever they occur
- positive: little impact on the generalization performance
- negative: leads to sparse matrix operations, which are known to be hard to accelerate, and which are typically memory-bound
structured pruning
- motivation: a group of parameters (e.g., entire convolutional filters) is removed.
- positive: still permitting dense matrix operations.
- negative: aggressive structured pruning often leads to significant accuracy degradation.

4.Knowledge distillation

motivation: training a large model and then using it as a teacher to train a more compact model.
positive: mix knowledge distillation with prior method(i.e.也就是quantization and pruning ) has succeed
negative: a major challenge here is to achieve a high compression ratio with distillation alone.non-negligible accuracy degradation with aggressive compression.

5.Quantization

present situation: has shown great and consistent success in both training and inference.this survey focused on inference.
shortcoming: very difficult to go below half-precision without significant tuning, and most of the recent quantization research has focused on inference.

6.Similarity of Quantization and Neuroscience

motivation: work in neuroscience that suggests that the human brain stores information in a discrete/quantized form, rather than in a continuous form.

II. GENERAL HISTORY OF QUANTIZATION

III. BASIC CONCEPTS OF QUANTIZATION

Problem Setup and Notations
Uniform Quantization
Symmetric and Asymmetric Quantization
Range Calibration Algorithms: Static vs Dynamic Quantization
Quantization Granularity
Non-Uniform Quantization
Fine-tuning Methods
- Quantization-Aware Training
- Stochastic Quantization

IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS

Simulated and Integer-only Quantization
Mixed-Precision Quantization
Hardware Aware Quantization
Distillation-Assisted Quantization
Extreme Quantization
- Quantization Error Minimization
- Improved Loss function
- Improved Training Method
Vector Quantization

V. QUANTIZATION AND HARDWARE PROCESSORS

VI. FUTURE DIRECTIONS FOR RESEARCH IN QUANTIZATION

Quantization Software
Hardware and NN Architecture Co-Design
Coupled Compression Methods
Quantized Training

Tensor Dimensions

发表于 2022-09-06 更新于 2024-12-02

1.number of tensor’s dimension

python 
import torch 
t=torch.ones(1,2,3,4) #创建dim=4,size=[1,2,3,4]的tensor
t #显示t
t.size() #显示t的size

通过tensor t的显示可以看出，tensor的dim=从外到遇到第一个元素的 ‘[‘ 个数

2.index of tensor’s dimension

t.flatten(start_dim=0).size()
t.flatten(start_dim=1).size()
t.flatten(start_dim=2).size()
t.flatten(start_dim=3).size()
t.flatten(start_dim=-1).size()

通过把t从各个dim展平判断出tensor t的dim下标依次是[0,1,2,3],另外dim 3==-1。

note

关于torch.flatten()此method介绍见：https://pytorch.org/docs/stable/generated/torch.flatten.html#torch.flatten

Week Report

发表于 2022-09-05 更新于 2024-12-02

进度汇报（2022.8.31-2022.9.4）

读zeroq量化代码并在实验室集群上调试，跑通了对resnet18的压缩
读赵随意师兄的BaselineIR的部分代码，并放在集群上进行了训练，
解决了数据集路径错误、训练中途闪退问题。

本周计划（2022.8.31-2022.9.4）

阅读最新Quantization综述以及部分zero-shot quantization最新论文，整理思路。
解决zeroq量化代码中看不懂的部分

进度汇报（2022.9.5-2022.9.11）

对照论文理解了zeroq代码，并对distill data部分的total_loss进行修改,实验证明在分类任务上原total_loss的部分成分对量化后的精度影响极小，top-1 accuracy 波动在0.01%左右。
阅读投稿JMLC一篇文章的手稿和相关论文，对疑问点进行评论。
阅读2021模型量化最新综述。

本周计划（2022.9.5-2022.9.11）

floating point quantization: 阅读论文FP8 Quantization: The Power of the Exponent(2022.8)及相关论文，尝试代码实现。
fixed point quantization: 阅读论文Post training 4-bit quantization of convolutional networks for rapid-deployment(2019)

进度汇报（2022.9.12-2022.9.18）

阅读论文FP8 Quantization: The Power of the Exponent(2022.8)，尝试代码实现
阅读QPyTorch: A Low-Precision Arithmetic Simulation Framework(2019.10),并阅读了框架部分源码
中翻英 teaching statement

本周计划（2022.9.12-2022.9.18）

floating point quantization: 继续FP8 coding相关部分学习
fixed point quantization: 阅读论文Post training 4-bit quantization of convolutional networks for rapid-deployment(2019)

进度汇报（2022.9.19-2022.9.25）

floating point quantization: FP8 coding -> 进度缓慢,暂时放下
fixed point quantization:
- 阅读论文Post training 4-bit quantization of convolutional networks for rapid-deployment(2019)，理解ACIQ
- 阅读论文SQUANT(2022)

本周计划（2022.9.19-2022.9.25）

SQUANT 源码理解
审稿

进度汇报（2022.9.26-2022.10.9）

SQUANT 源码理解
审稿

本周计划（2022.9.26-2022.10.9）

重温zeroq,和21年最新综述
重构SQuant代码,作为以后工作的baseline
审稿

进度汇报（2022.10.10-2022.10.16）

理清了SQuant代码中是如何控制tensor shape的转变来进行kernel-wise channel-wise不同粒度的量化,对代码中的杂乱、无效部分做了重构
审稿:EC0065699_O_基于改进YOLOX的公路路面裂缝检测网络

本周计划（2022.10.10-2022.10.16）

重温zeroq和21年最新综述
QAT量化感知训练入门，在MNIST数据集上跑通一个demo

进度汇报（2022.10.17-2022.10.23）

回顾ZeroQ和21年最新综述,组会准备
特殊环境去雨项目:填写开题表格,做PPT
QAT量化感知训练入门,在MNIST数据集上的demo尚未跑通

本周计划（2022.10.17-2022.10.23）

实验:将ZeroQ蒸馏出的数据用在SQuant中间层激励的剪切范围确定上,看效果
跑通MNIST数据集上的QAT量化demo
完成学科前沿作业;准备数理统计考试

进度汇报（2022.10.24-2022.10.30）

实验:将ZeroQ蒸馏出的数据用在SQuant中间层激励的剪切范围确定上,效果如下：

Experiment	Model	Dataset	W-bit	A-bit	Top-1 Accuracy	Top-5 Accuracy	Activation Clip Range Setting
Gaussian_data(μ=0,σ=1)	Resnet18	ImageNet	8bit	8bit	73.012%	91.036%	sigma = 25
Gaussian_data(μ=0,σ=1)	Resnet18	ImageNet	8bit	8bit	73.066%	90.990%	sigma = 30(较sigma=25增大了clip range)
ZeroQ_Refined_Data	Resnet18	ImageNet	8bit	8bit	72.854%	91.008%	sigma = 25
ZeroQ_Refined_Data	Resnet18	ImageNet	8bit	8bit	67.308%	87.524%	sigma = 0(clip range:[0, max])

在MNIST数据集上跑通了的量化感知训练(QAT)的demo
阅读论文 Data-Free Quantization Through Weight Equalization and Bias Correction (2019)

本周计划（2022.10.24-2022.10.30）

fix上周实验出现的中间层activation异常偏大的bug
实验评估：在原模型上统计clip range与叠加量化clip error和round error之后统计clip range, 二者对量化结果精度的影响
处理作业，准备考试

进度汇报（2022.10.31-2022.11.06）

实验:在原模型上统计clip range与叠加量化clip error和round error之后统计clip range,精度对比：

Experiment	Model	Dataset	W-bit	A-bit	Top-1 Accuracy	Top-5 Accuracy	Activation Clip Range Setting
Gaussian_data(μ=0,σ=1)+叠加量化error统计clip range	Resnet18	ImageNet	8bit	8bit	73.012%	91.036%	sigma = 25
Gaussian_data(μ=0,σ=1)+原模型统计clip range	Resnet18	ImageNet	8bit	8bit	72.394%	90.656%	sigma = 25
ZeroQ_refined_data+叠加量化error统计clip range	Resnet18	ImageNet	8bit	8bit	72.854%	91.008%	sigma = 25
ZeroQ_refined_data+原模型统计clip range	Resnet18	ImageNet	8bit	8bit	72.836%	90.948%	sigma = 25

结论：在原模型上统计clip range相较于叠加了量化error再统计clip range,最终的精度出现轻微下降，符合猜测

解决了上周实验出现的中间层activation异常偏大的bug
初步改出了fp32模拟量化到fp8的代码,在resnet18上实验效果不理想

本周计划（2022.10.31-2022.11.06）

结合相关论文,改进fp8模拟量化
组会准备;考试准备

进度汇报（2022.11.07-2022.11.13）

组会准备;处理交代的改稿工作
数理统计考试准备

本周计划（2022.11.07-2022.11.13）

矩阵理论考试准备

进度汇报（2022.11.14-2022.11.20）

处理考试相关
整理之前解决问题的很多网页记录；学习docker

本周计划（2022.11.14-2022.11.20）

阅读浮点量化相关论文，改进FP8量化

进度汇报（2022.11.21-2022.11.27）

FP8: 对于fixed FP8 formats即在sign:exponet:mantissa固定在1:4:3或1:5:2的情况下，如果像IEEE754 FP32一样固定尾数偏移，得不到论文中展示的接近或优于IN8的表现，采用类似INT8的缩放策略亦未奏效。
审稿

本周计划（2022.11.21-2022.11.27）

课程考试、作业处理
阅读论文:On-Device Training Under 256KB Memory

进度汇报（2022.12.05-2022.12.11）

调研了解模型剪枝和稀疏化
尝试在手机上部署NCNN mobilenetssd(demo)

本周计划（2022.12.05-2022.12.11）

审稿
准备计算机体系结构考试

进度汇报（2022.12.12-2022.12.18）

ICASSP 审稿
了解NCNN,在手机上部署demo: mobilenetssd yolov7

本周计划（2022.12.12-2022.12.18）

读论文：PD-Quant(2022.12)
调研了解Mixed Precision Quantization相关工作

进度汇报（2022.12.19-2022.12.26）

读论文：PD-Quant(2022.12)
调研了解Mixed Precision Quantization相关工作

本周计划（2022.12.19-2022.12.26）

将PD-Quant引入的量化损失Metric与之前工作结合,尝试复现
继续调研Mixed Precision Quantization相关

进度汇报（2022.12.27-2023.1.1）

在量化参数scaling factor和offset选取上借鉴PD-Quant引入当前层activation量化在最终预测结果引起的差异。不仅考虑当前层activation量化前后差异，还引入当前层的activation量化在后面若干层累积后引起的差异。
改稿

后期计划（2022.12.27-2023.1.1）

继续完善量化引入全局累积差异的代码，提升性能

进度汇报（2023.2.8-2023.2.12）

workshop (NTIRE 2023 Efficient Super-Resolution Challenge)

Efficient Super-Resolution Challenge(ESR):train rfdn baseline,test 结果如下

Model	Dataset	Val PSNR	Val Time [ms]	Params [M]	FLOPs [G]	Acts [M]	Mem [M]	Conv
trained_rfdn_best	DIV2K_val(801-900)	28.73	37.62	0.433	27.10	112.03	788.13	64
RFDN_baseline_1	DIV2K_val(801-900)	29.04	41.38	0.433	27.10	112.03	788.13	64
RFDN_baseline_2	DIV2K_val(801-900)	29.04	43.86	0.433	27.10	112.03	788.13	64
RFDN_baseline_3	DIV2K_val(801-900)	29.04	37.59	0.433	27.10	112.03	788.13	64

后期计划（2023.2.8-2023.2.12）

rfdn基础上改进模型

进度汇报（2023.2.13-2023.2.19）

workshop (MAI 2023 Video Super Resolution)

先train 2022官方仓库 MRRN baseline

环境配置

Python 3.8.10
Tensorflow 2.9.0
- 查看tensorflow cuda cudnn python 版本对照表： https://www.tensorflow.org/install/source_windows
Cuda 11.2

Cudnn v8.7.0

官网：https://developer.nvidia.com/cudnn
uname -m 查看cpu架构，cudnn有不同架构的版本 x86_64 PPC SBSA
tar -xvf解压缩后用以下命令安装并赋予所有用户读取权限

1
2
3

sudo cp path_to_cudnn/include/cudnn*    /usr/local/cuda/include
sudo cp path_to_cudnn/lib64/libcudnn*    /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*   /usr/local/cuda/lib64/libcudnn*

Cudnn和Cuda 安装完需在/etc/profile配置环境变量PATH和LD_LIBRARY_PATH

1
2
3

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda

可将文件夹 /usr/local/cuda-11.2 与 /usr/local/cuda 软连接起来

1	ln -s /usr/local/cuda-11.2 /usr/local/cuda

用下面的命令查看cudnn版本,新版本查看cuDNN版本的命令为

1	cat /usr/local/cuda/include/cudnn_version.h \| grep CUDNN_MAJOR -A 2 # -A 选项用来指定匹配成功的行之后显示2行内容

结果
1. 用默认config.yml训练太慢了大约需要1周时间，中途停掉了
2. 用改进后config.yml训练。8小时左右训练完成，但是loss很大
3. 结合往年此赛道总结文章放弃训练提供的mobilernn baseline 思考其它基于cnn的模型

从NTIRE 2022 efficient super-resolution challenge选取baseline运用剪枝蒸馏等改进到移动端
- 选取2022 NTIRE ESR冠军方案RLFN(Byte Dance)作为baseline,先将其模型转换为 tensorflow 版本在 REDS 数据集上直接进行VSR的测试 -> 中间软件依赖兼容性问题放弃RLFN torch->onnx->tensorflow路线
- 直接用tensorflow 重构 RLFN -> train完精度不够’psnr’: 25.651411, ‘ssim’: 0.6954131，需要调试改进

本周计划

workshop 改进
- Pruning via NNI
- Quantization via NNI
- Hyper Parameter Optimization via NNI
基金研究基础改小错误
考试复习

进度汇报（2023.2.20-2023.2.26）

workshop (MAI 2023 Video Super Resolution)
1. RLFN精度提升 ‘psnr’: 25.57 -> 25.91, ‘ssim’: 0.69 -> 0.71; 重构的baseline结构可能存在问题，需要与原作保持一致，恢复精度
2. 尝试基于RNN的方案 SWRN, ‘psnr’: 28.19, ‘ssim’: 0.8093;
3. 3D卷积用于视频超分调研，后期可以考虑3D卷积重参数化加速推理

本周计划（2023.2.20-2023.2.26）

workshop
- 恢复baseline模型tensorflow版本的PSNR
- 有余力剪枝加速推理
考试复习

进度汇报（2023.2.27-2023.3.5）

workshop (MAI 2023 Video Super Resolution)

目前PSNR、SSIM比较满意的是VapSR_2,但是推理时间太长,需要优化

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]
SWRN_0	Origin	REDS	27.931335	0.7803562	43,472	25.6
SWRN_1	recon_trunk block num=2	REDS	27.820051	0.77666414	36,512	26.9
ELSR_0(vsr 22 winner)	Origin	REDS	26.716854	0.73988235	3,468	19.3
VapSR_0	Origin	REDS	28.103758	0.7864979	154,252	5191.0
VapSR_1	Replace feature extraction conv and VAB’s 2 con1X1 with blueprint conv	REDS	28.02941	0.7845887	155,916	5798.0
VapSR_2	Replace feature extraction conv with blueprint conv and reduce Attention’s kernel size=3X3	REDS	28.021387	0.7831156	131,276	2694.0

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: FP16
Acceleration: TFLite GPU Delegate

本周计划（2023.2.27-2023.3.5）

workshop
- 利用tensorflow model optimization toolkit(TFMOT)进行模型压缩,看效果
- 寻找新idea
两门考试复习

进度汇报（2023.3.6-2023.3.12）

workshop (MAI 2023 Video Super Resolution)

tensorflow model optimization toolkit(TFMOT) Pruning： tensorflow中对于子类化的模型(Subclassed Model)剪枝支持不好，代码有些问题，目前只对模型头部特征提取的蓝图卷积部分应用到了，对于中间参数量最大的部分没有应用成功，推理时间降低很少。

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]
VapSR_2	Replace feature extraction conv with blueprint conv and reduce Attention’s kernel size=3X3	REDS	28.021387	0.7831156	131,276	2694.0
VapSR_3	Pruning	REDS	28.021387	0.7831156	131,276	2673.0

本周计划（2023.3.6-2023.3.12）

workshop
- 将模型重构为函数式模型(Functional Model)再进行剪枝
- 模型量化部署
审稿
组会准备

进度汇报（2023.3.13-2023.3.19）

workshop (MAI 2023 Video Super Resolution)

tensorflow model optimization toolkit(TFMOT) Pruning, Weight Clustering：成功对模型的全部卷积层应用了50%的剪枝和总体10个权重聚类中心的聚类,参数量较之前只对特征提取部分应用剪枝下降很多

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]
VapSR_3	Pruning feature extraction part	REDS	28.021387	0.7831156	131,276	-
VapSR_4	apply pruning, weights clustering to conv kernels	REDS	27.3111	0.7376	32,054	-

本周计划（2023.3.13-2023.3.19）

workshop
- 模型8bit量化实现
- 寻找降低runtime的新方法

进度汇报（2023.3.20-2023.3.26）

workshop (MAI 2023 Video Super Resolution)
- tensorflow model optimization toolkit(TFMOT) Pruning, Weight Clustering -> INT8 Quantization Aware Training (QAT) -> tflite
  - 问题1： QAT掉点严重 {‘psnr’: 27.666351, ‘ssim’: 0.77187574} -> {‘psnr’: 27.008348, ‘ssim’: 0.7406609}
  - 问题2：转换为tflite模型过程没问题，手机上用AI Benchmark测试runtime一直报输入type/shape mismatch, 定位不到bug在哪里,怀疑是软件不能自定义输入的dtype导致的
    - 寻求AI Benchmark论坛的帮助
    - 考虑借鉴tensorflow在andriod上的超分案例，测试转换过来的tflite模型
work
- 参考CVPR2023”Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks”移植了partial convolution (PConv)模块

本周计划（2023.3.20-2023.3.26）

workshop
- 解决以上问题
work
- 提升在REDS数据集上的PSNR

进度汇报（2023.3.27-2023.4.2）

workshop (MAI 2023 Video Super Resolution)

解决了在手机上对tflite模型推理只能使用CPU无法应用TFLite GPU Delegate 和 NNAPI加速的问题
最新结果

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
SWAT_3	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.761642	0.7748446	25,664	27.8 (FP16_TFLite GPU Delegate)	2.949

work
- 移动端视频超分的模型基本确定，并同步进行了部分对比实验，后续需要对剪枝/权重聚类/量化部分进行微调
- 文献搜集，了解移动端视频超分的现状和最新进展

本周计划（2023.3.27-2023.4.2）

workshop
- 对训练好的模型的量化策略进行调整，看能否进一步降低推理时间
- 尝试新的训练损失函数/训练策略，看能否进一步提升PSNR SSIM
work
- 整理思路，论文撰写

进度汇报（2023.4.3-2023.4.9）

workshop (MAI 2023 Video Super Resolution)

在用L1 charbonnier损失进行预训练后，继续使用L2损失训练 -> PSNR：27.76 -> 27.81 上升
改进注意力模块：1.增大感受野 2.部分卷积用分组卷积替代 -> Params: 25,664 -> 24,160 下降 FLOPs: 2.949 -> 2.776 下降，但是runtime反而上涨了 27.8 -> 30.0 tflite对分组卷积的支持不好
目前最好结果

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
SWAT_5	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 250,000	REDS	27.811176	0.7763541	25,664	27.6 (FP16_TFLite GPU Delegate)	2.949

work
- 搜集在REDS数据集上完全相同实验设置的paper
- 剪枝/权重聚类的代码之前在基于单帧的单输入单输出模型上跑通，现模型多输入多输出，进行调整后现已跑通

本周计划（2023.4.3-2023.4.9）

work
- 汇总完全相同实验设置的paper的结果
- 完成模型量化部分
- 撰写论文

进度汇报（2023.4.17-2023.4.23）

video super-resolution work
- Channel/Spatial/Pixel Attention RNN调研
- Dynamic pruning/Sparsity调研
  1. Dynamic Channel Pruning: Feature Boosting and Suppression (ICLR 2019)
    - method: subsample feature map to scalar -> channel saliency predictor (fully connect) -> multiple winners take all channel select (Top-k select)
    - summary: MAC saving, Memory Usage saving but cann’t contribute to inference latency saving. Fail to achieve real-world acceleration because their hardware-incompatible channel sparsity results in repeatedly indexing and copying selected filters to a new contiguous memory for multiplication.
  2. Dynamic Slimmable Network (CVPR 2021)
    - method: In-place distillation with In-place Ensemble Bootstrapping (IEB) scheme to train Dynamic Supernet -> sandwich gate sparsification (SGS) to train Dynamic Slimming Gate
    - summary: dynamic slice-able conv achieved by double-headed dynamic gate which can achieve practical acceleration for filters remain contiguous and static during dynamic weight selection.
- Information multi-distillation Block (IMDB) 超分实现，推理延迟降低到20ms
Model Description Dataset Val PSNR Val SSIM Params Runtime on oneplus7T [ms] FLOPs [G]

SORT_0 Sliding Window, IMDB REDS 27.738451 0.77409536 17,356 20.6 (FP16_TFLite GPU Delegate) 2.084

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
SORT_0	Sliding Window, IMDB	REDS	27.738451	0.77409536	17,356	20.6 (FP16_TFLite GPU Delegate)	2.084

本周计划（2023.4.17-2023.4.23）

video super-resolution work
- Dynamic routing 尝试改造现有模型(一周时间出效果，PSNR⬆ -> 28.00, Runtime⬇ -> 20ms)
审稿

进度汇报（2023.5.8-2023.5.14）

video super-resolution work
- video frame selection 调研
- SWAT SORT hyperparameter fine tuning
- Dynamic routing was deprecated

本周计划（2023.5.8-2023.5.14）

video super-resolution work
- 写论文，调模型

进度汇报（2023.5.15-2023.5.21）

video super-resolution work
- 论文introduction撰写了一部分
- 模型加入Non Activation Block 和前一帧HR预测做对齐辅助(former predicted HR frame acting as auxiliary aligned frame)

本周计划（2023.5.15-2023.5.21）

video super-resolution work
- 模型改进：Information Multi Distillation Block(IMDB) modify -> high psnr contribution channels aggregation + partial conv
- 模型改进：light weight feature alignment 轻量级特征对齐
- 论文撰写：Introduction + Related Work

进度汇报（2023.5.22-2023.5.28）

video super-resolution work

论文撰写
模型微调：
1. Activation Free Block 利用乘法产生代替激活函数产生非线性 -> PSNR ↓, SSIM ↓
2. 利用注意力机制对与主网络超分输出做残差连接的bilinear上采样进行对齐操作, 对齐的目标是前一帧HR预测帧(former predicted HR frame acting as auxiliary align frame) -> PSNR ↑, SSIM ↑

Benchmark

Rank	Model	Source	Dataset	Test PSNR	Test SSIM	Params
1	Diggers	Real-Time Video Super-Resolution based on Bidirectional RNNs(2021 SOTA)	REDS(train_videos: 240, test_videos: 30)	27.98	-	39,640
2	VSR_12	Ours	REDS(train_videos: 240, test_videos: 30)	27.981062	0.7824855	57,696
3	SORT_2	Ours	REDS(train_videos: 240, test_videos: 30)	27.93981	0.7808094	45,264
4	SWRN	Sliding Window Recurrent Network for Efficient Video Super-Resolution (2022 SOTA)	REDS(train_videos: 240, test_videos: 30)	27.92	0.77	43,472
5	MVSR_0	Ours	REDS(train_videos: 240, test_videos: 30)	27.915539	0.7799377	35,777
6	SWAT_3_5	Ours	REDS(train_videos: 240, test_videos: 30)	27.840628	0.7774375	37,312
7	EESRNet	EESRNet: A Network for Energy Efficient Super-Resolution(2022)	REDS(train_videos: 240, test_videos: 30)	27.84	-	62,550
8	LiDeR	LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices (2022)	REDS(train_videos: 240, test_videos: 30)	27.51	0.76	-
9	EVSRNet	EVSRNet：Efficient Video Super-Resolution with Neural Architecture Search(2021)	REDS(train_videos: 240, test_videos: 30)	27.42	-	-
10	RCBSR	RCBSR: Re-parameterization Convolution Block for Super-Resolution(2022)	REDS(train_videos: 240, test_videos: 30)	27.28	0.775	-

pose/gesture recognition deploy
- 了解trt_pose类似项目的部署

本周计划（2023.5.22-2023.5.28）

video super-resolution work
- 论文撰写
- 模型微调：参考An Implicit Alignment for Video Super-Resolution(arxiv 2023.04), 尝试对帧对齐模块的注意力机制加入位置编码
pose/gesture recognition deploy
- 调研摄像头视角跟随目标的相关项目,尝试部署

进度汇报（2023.5.29-2023.6.04）

video super-resolution work
- 论文撰写
- 模型微调：
  1. light-weight attention based frame alignment(基于attention的轻量帧对齐): PSNR ↑ 0.003, SSIM ↑ 0.006, Runtime ↑ 9 ms
pose/gesture recognition deploy
- 调研摄像头视角跟随目标的相关项目

本周计划（2023.5.29-2023.6.04）

video super-resolution work
- 论文撰写
pose/gesture recognition deploy
- 尝试部署视角跟随目标

进度汇报（2023.6.05-2023.6.11）

video super-resolution work
- 论文撰写
- 消融实验

本周计划（2023.6.05-2023.6.11）

video super-resolution work
- 论文撰写
- 消融实验

进度汇报（2023.6.12-2023.7.01）

video super-resolution work
- 论文撰写、修改、提交

进度汇报（2023.7.03-2023.7.16）

video super-resolution on mobile
- 项目代码整理，上传github
Jetson Nano 部署 ZeroDCE,远远达不到实时性要求,处理单张512×512图片暗光增强耗时 > 2 min。具体结果如下：
NeurIPS 审稿
补充PPT: 模型压缩部署部分

本周计划（2023.7.03-2023.7.16）

调研了解最新量化进展，寻找下个工作方向
8-bit 浮点数量化项目(FP8 quantization)高通已开源，测试了解下有无follow的空间

进度汇报（2023.7.17-2023.7.23）

Jetson Nano 部署暗光增强 ZeroDCE++,处理单张512×512图片耗时约10ms,但有波动(最高4931.46 ms/张),基本满足实时性要求。

今后计划（2023.7.17-2023.7.23）

休假(威海潍坊邯郸)
ChinaMM云南行(昆明丽江大理)

进度汇报（2023.8.07-2023.8.13）

Jetson Nano 部署 Face Tracking,结合之前的Pose Estimation 达不到实时30 frame/s的要求
PRCV审稿
FP8 Quantization 调研
- FP8 Quantization: The Power of the Exponent (Qualcomm_NeurIPS 2022)
  1. FP8更适应离群值多的场景
  2. PTQ时精度优于INT8，QAT时精度比INT8略差
- FP8 FORMATS FOR DEEP LEARNING (NVIDIA/Arm/Intel_ArXiv 2022.09) -> 训练推理统一数据格式FP8
  1. FP8 可以加速训练和减少训练所需的资源，同时方便部署且可以保证训练出的精度
  2. INT8 量化模型通常需要进行校准或微调，训练与推理数据类型不一致不便于部署，且通常精度会下降
- FP8 versus INT8 for efficient deep learning inference (Qualcomm_ArXiv 2023.06) -> FP8 目前在性能和精度上不能取代INT8推理，目前INT4-INT8-INT16是边缘端推理的最优解
  1. PTQ时在离群值显著的情况下，FP8相较INT8有精度优势; 通常这种情况可以通过W8A16混合精度以及QAT来解决
  2. FP8推理硬件开销大, FP8 MAC 单元效率比 INT8 低50%至180%
  3. 为了更高效，已经有一些INT4量化的工具, 但到目前为止并没有FP4相关的工作
- Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models (MSRA_ArXiv 2023.05) -> Layer wise混合精度LLM

本周计划（2023.8.07-2023.8.13）

开题报告 -> success
FANI andriod app -> false
实习简历 -> success
天空之眼广域持续监视系统 PPT -> success
尝试魔改 Qualcomm_FP8 + SQuant -> pending
尝试集成Face Tracking 和 Pose Estimation实现相机角度跟随人体并进行姿态估计 -> pending
大湾区算法比赛：视频插帧 + 单目深度估计 -> pending
ACwing 算法课 -> pending
CSAPP 课程+实验 -> pending
Dipoorlet MQBench 使用 -> pending

进度汇报（2023.8.14-2023.8.27）

机载广域持续监视方案调研,PPT制作
开题报告
jetson nano 项目：Face Tracking + Pose Estimation
- 原有基于nvidia官方trt_pose项目的姿态估计推理速度慢,现基于Shanghai AI Lab 2023最新的轻量姿态估计项目RTMPose进行部署
- 调研了解商汤MMdeploy 和 MMPose项目，编译安装相关依赖并在jetson nano上搭建了部署环境
- 完成了驱动舵机调整摄像头位置的C++代码，后续通过ctypes库实现在py文件中调用此部分调整摄像头姿态的C++代码

本周计划（2023.8.14-2023.8.27）

完成jetson nano 项目：Face Tracking + Pose Estimation
调研了解TensorRT/TNN/MNN/NCNN等推理框架，重点尝试运用TensorRT加速RTMPose的推理
参加大湾区算法比赛：视频插帧 + 单目深度估计
实习简历投递

进度汇报（2023.8.28-2023.9.10）

完成jetson nano 项目：Face Tracking + Pose Estimation -> 70%
调研了解TensorRT/TNN/MNN/NCNN等推理框架，重点尝试运用TensorRT加速RTMPose的推理 -> 0%
参加大湾区算法比赛：视频插帧 + 单目深度估计 -> 0%

本周计划（2023.8.28-2023.9.10）

实习简历投递(30-40份)
调研ChatGPT的各种应用
参加大湾区算法比赛：视频插帧 + 单目深度估计，baseline搭建
撰写FANI专利,整理已有专利状态

进度汇报（2023.9.11-2023.9.17）

ICDM投稿FANI专利撰写
组内已有专利状态整理
大湾区算法比赛：视频插帧 -> 0%, 单目深度估计 -> 10%

本周计划（2023.9.11-2023.9.17）

实习简历投递(30-40份)
调研ChatGPT的各种应用
大湾区算法比赛：视频插帧 -> 50% (baseline搭建); 单目深度估计 -> 50% (baseline搭建)

进度汇报（2023.9.18-2023.10.08）

大湾区单目深度估计比赛：
- 数据的理解存在偏差，涉及共计6个不同数据集的ground truth, label的标签意义未能理解清(如单位mm还是m, skymask, validmask等等)
- 选择部分结构清晰(仅包含imgs, gts)的数据集送入目前的SOTA模型 ZoeDepth 对其 metric bins module 进行微调，结果训练后的精度比原作只在 NYU Depth V2 数据集上进行微调的效果还差
- 目前的提交的结果：A榜 42/60, B榜决定最终排名尚未出结果
视频超分量化
国奖申请答辩/助教申请/调研准备组会PPT
实习申请（累计投递40+）

本周计划（2023.9.18-2023.10.08）

实习投递+笔试面试
推进SOTA视频超分模型量化 -> 目标可部署在 OnePlus 7T 上

进度汇报（2023.10.09-2023.10.15）

Video Super-Resolution Quantization
- 基于目前在Vid4、Vimeo90k、REDS数据集上SOTA模型 BasicVSR++ 进行channel-wise distribution-aware 量化pipeline的搭建（目前尚没有视频超分量化超分方向的baseline，代码难度较大）
- 尝试引入在其它视频感知任务（Human Pose Estimation，Semantic Segmentation，Video Object Segmentation）有效上的方法，参考 ICCV2023 ResQ 将网络中相邻帧的激活之间的残差用于量化，更小的方差有利于缩小量化误差
  - 附图：
ICDM注册提交

本周计划（2023.10.09-2023.10.15）

BasicVSR++/VRT/RVRT + ResQ/CADyQ/DAQ -> VSR quantization pipeline construction

进度汇报（2023.10.16-2023.10.22）

Video Super-Resolution Quantization
- 参考 GPTQ 完成了 BasicVSR++ (未涉及ViT)量化的基础部分
- 阅读论文,了解其它几个SOTA模型(ViTs)是否有需要单独改进的模块：
  - CVPR2022: TTVSR
  - NIPS2022: PSRT, RVRT
  - CVPR2023: IART
SelecQ latex 排版调整,期刊注册提交
学校HPC实例到期, 实验室浪潮集群上 Docker 镜像搭建

本周计划（2023.10.16-2023.10.22）

结合 ResQ 采用 PaddlePaddleSlim 改进视频超分 BasicVSR++ 量化模块
深度神经网络课程PPT制作
专利修改
一番摆事实讲道理: 老师同意 sensetime 实习3个月, 开心到爆炸 :)

进度汇报（2023.10.23-2023.10.29）

Video Super-Resolution Quantization
- 试用百度 paddleslim 分别用静态动态量化（PTQ）对 BasicVSR++ 进行量化
深度神经网络课程PPT制作
拔智齿~ -> 耽误了一些 VSR Quantization 工作的进度

本周计划（2023.10.23-2023.10.29）

play with paddleslim and basicvsr++

进度汇报（2023.10.30-2023.11.05）

Video Super-Resolution Quantization
- BasicVSR++ PTQ：量化过程有bug正在解决
  1. BasicVSR++ torch模型转onnx模型并检查
  2. 激活校准，产出量化参数: scale zero_point
  3. 权重调整，提升量化精度
  4. 量化误差分析，定位量化问题
- note: 目前 BasicVSR++ 的 PTQ 基于开源工具 Dipoorlet 进行，优点代码简洁明了易修改，相较百度框架 paddleslim 便于快捷验证idea; VSR 量化方法成熟后可进一步迁移至 paddleslim
读文献找idea提升PTQ精度

本周计划（2023.10.30-2023.11.05）

解决 dipoorlet 量化 BasicVSR++ 遇到的bug

进度汇报 (23.11.06-23.11.12)

Video Super-Resolution Quantization
- BasicVSR++ 采用 Dipoorlet PTQ：量化过程有不支持动态输入的问题, 即不支持视频随机长度(time_step)的问题, github提了issue 暂未有回复
- BasicVSR++ 采用 MQBench PTQ: BasicVSR++ 模型 forward 过程存在动态控制流, 即控制流的判断条件含有运算变量(Input/Activation)参与, 而MQBench调用 torch.fx 的 symbolic_trace 完成 forward 过程计算图捕捉, 其本身的限制不支持动态控制流。正尝试：
  1. 把模型的动态控制流用静态的代替
  2. torch 2.0 新发布的 torch.compile 也即 (TorchDynamo), 了解后尝试来解决模型 forward 中广泛存在的动态控制流
RustDesk 中继服务搭建, 降低远程桌面的延迟

本周计划 (23.11.06-23.11.12)

推进 VSR 模型的常规量化(Naive PTQ)的工作
实习相关工作

进度汇报 (23.11.13-23.11.19)

Video Super-Resolution Quantization
- BasicVSR++ 采用 Dipoorlet PTQ：量化过程有不支持动态输入的问题, 即不支持视频随机长度(time_step)的问题, github官方答复暂不支持
Sensetime Internship
- 与mentor沟通了解了实际业务量化部署过程中大体流程以及其重难点(torch模型 -> onnx计算图中间表示 -> 目标平台SDK),尚未接触实际项目
- 调研了解LLM Quantization, 之后会逐步扩展形成对 LLM -> Transformer -> CNN整个链条的量化部署的覆盖

本周计划 (23.11.13-23.11.19)

完成 LLM Quantization 的初步调研(会将文档共享给大家), 和mentor探讨下一步该从哪种方法开始上手复现
准备 ICDM presentation -> false

进度汇报 (23.11.20-23.11.26)

Sensetime Internship
- LLM Quantization 累计调研10篇典型文章，包含QAT, weight only quantization, weight and activation quantization。给mentor做了初步的讨论汇报

本周计划 (23.11.20-23.11.26)

Video Super-Resolution Quantization
- 准备 ICDM presentation
Sensetime Internship
- LLM Quantization 环境搭建, 在llama 7B 上跑了一下GPTQ~
- 了解 Intel neural-compressor QAT 量化
- 了解 Pytorch FX QAT 量化
- 了解 mmdeploy 量化
- 推导验证 RepVGG QAT 多支路 conv 合并后图节点量化前浮点数范围 (Real Range) 能否无损得到
- 多模态模型 codino -> onnx -> onnx runtime(ort) register grid_sampler 注册未支持的算子(参考MMCV MMDeploy),然后在A6000上部署推理
Auto Drive
- 两个导向一种不惜计算代价，使用各种方法提高特定数据集/特定环境下的指标 (刷榜)，一种是轻量化资源受限情况下优化指标
- 轻量化路线具体需要针对不同的硬件平台(如 jetson orin)来进行相应的轻量化,如:
  - 设计轻量的网络结构，设计相应的算子op
  - 考虑部署的inference latency,有一些比较有实际意义的探索空间,原因在于纸面的模型 FLOPs/MACs 和实际 inference latency 之间有 gap
  - 实际车辆上运行的模型受限于算力，算力小用CNN，算力大用Transformer, 大公司在往大一统方向做 (如CVPR 2023 best paper: UniAD, 但目前还没有部署到实际的平台上去,目前带我的mentor在进行部署的工作：很难的一点是如何正确的把模型转换为onnx中间表示然后去进行量化，这一步还没完成)

进度汇报 (23.11.27-23.12.10)

School Task
- ICDM presentation PPT 制作/参会
- VSR 视频超分专利修改
- Face Super-resolution/Enhancement 调研
Sensetime Internship
- LLM Quantization 环境搭建, 在 llama 7B 上跑了一下 GPTQ
- 了解 Intel neural-compressor / Pytorch FX / mmdeploy PTQ 量化,大致如下以 torch fx 为例 (后续会结合代码形成较详细的量化pipeline文档)
  1. Prepare fx: fuse模型，也就是通常的优化，比如conv+bn啥的,利用fx对模型进行transform
  2. Insert observer: Input/Output/Weight 均插入observer
  3. Calibration: 输入数据进行校准，收集 weights 和 activation 的 max 和 min 等统计信息
  4. Convert fx: 在 observer 位置用相应的 quantize/dequantize module代替,并合并到原始的layer中
    附图：
- 推导验证 RepVGG 重参数轻量化方法与 QAT 联合使用的 weight range 能否获得数学上的等价变换问题: 具体来说，多支路 conv 合并前运用 QAT 提升精度, 在多支路 conv 合并后根据 QAT 训练得到的多分支 conv weight 的浮点数范围 (Real Range) 能否等价得到合并后的 conv weight 的浮点数范围，此浮点数范围与量化/反量化过程的 scale factor 和 zero point 基本等价可互推 -> 结论无法无损等价
- 感知模型 object detection: codino -> onnx -> onnx runtime(ort) 推理部署： register grid_sampler op 注册未支持的算子(参考MMCV MMDeploy),然后在 CPU 上实现模型推理, 后续搭建好模型前后处理的部分后会在GPU上进行推理效果验证

后续计划 (23.11.27-23.12.10)

了解并测试多个开源 LLM PTQ 方法，如 AWQ SmoothQuant ZeroQuantV2 等
codino 模型前后处理部分搭建, 在 GPU 上进行 onnx runtime 推理部署
调研了解能否用 onnx runtime 进行 QAT, 以减轻 QAT 与部署的 op 输入/输出/权重参数范围对齐的压力

进度汇报 (23.11.11-23.12.24)

School Task
- 深度学习与深度神经网络原理课程报告汇总提交
Sensetime Internship
- obeject detection model deploy: 基于openmmlab开源框架mmcv mmdetection,花费较长时间处理代码相关细节，具体如下
  1. 基于继承机制的模型config使用，模型 backbone 为 InternImage
  2. 模型前后处理剥离，前处理主要是resize操作,后处理模型输出为shape为[n, 5]的np.ndarray如何正确转化为 bbox 与 class 并可视化出来
  3. torch 模型转 onnx 时未支持算子 grid_sampler 处理
  4. 相关repository:
    - https://github.com/Sense-X/Co-DETR
    - https://github.com/OpenGVLab/InternImage
- quantization frame work 调研，包括 Torch FX Quantization, Intel Neural Compressor, Tensor RT Quantization等,具体见PDF总结
- 调研并尝试基于 onnx 做 QAT(背景: torch 模型转为 onnx 时会出现模型结构名变化的问题，导致 QAT 得到的模型权重以及量化参数 i.e. scale factor, zero point 无法与导出的 onnx 模型匹配)
- 详细了解 LLM metric Perpelexity (PPL), 以及关联的 torch cross entropy 的计算细节
- opencloud 集群的使用, 了解 slurm 命令及参数等

后续计划 (23.11.11-23.12.24)

测试以下 LLM PTQ 方法：AWQ, SmoothQuant, Outlier Suppression+, LLM.int8()
推进 onnx QAT
video super-resolution：基于 torch fx 对 basicvsr++ 进行 post training dynamic quantization (activation 的 scale factor 在推理时确定而非预先根据统计量计算出来) -> 搞明白 Torch FX OPs 插入 / 融合 / 操作的位置与做法
论文写作课程作业

进度汇报 (23.12.25-24.01.07)

School Task
- 论文写作课程作业
- 参加了上海人工智能实验室组织的书生·浦语大模型实战营，目前已完成前两节,部署了一下类似chatgpt的问答demo,简单笔记如下~
  - InterLM L-1: 书生·浦语大模型全链路开源开放体系
  - InterLM L-2: 轻松玩转书生·浦语大模型趣味 Demo
- 阅读多模态论文：CogVLM(https://github.com/THUDM/CogVLM)
  - 附图：
  - 整体思路：先将 image 输入通过 MLP 映射到与 text embedding 相同的空间中(上图左侧)，然后在预训练好的LLM上嫁接用与深度对齐两个模态的 Attention 和 MLP 部分，只训练这部分即可（上图右侧）
Sensetime Internship
- onnx QAT 项目停止：详细测试之后发现，torch.onnx.export() 模型导出后 onnx 模型结构名称与原 torch 模型的名称不匹配问题已在新版本解决。原版本为 torch1.8.0 现版本 torch1.13 及 torch2.x 都已不存在该问题
- 搞清了 Torch FX OPs 插入 / 融合 / 更改的位置与做法, 整个基于 torch fx graph 去做量化的流程已走通
- 尝试做集成主流 LLM 量化算法(包括 AWQ, SmoothQuant, GPTQ等)的库，并可扩展新的量化算法，对比学习了 Intel Neural-Compressor / LMDeploy / OpenPPL PPQ 等，目前还需要与 mentor 进一步探讨确定如何推进
- 其它学习的部分：量化 / MLLM 多模态大模型具体见 PPT

后续计划 (23.12.25-24.01.07)

跟进 video super-resolution 量化推进
LLM / MLLM 量化算法部署测试与集成推进
完成书生·浦语大模型实战营内容

进度汇报 (24.01.07-24.01.21)

书生·浦语大模型实战营内容完成(包括基于 InternLM 和 LangChain 搭建知识库, XTuner 大模型单卡低成本微调, LMDeploy 大模型量化部署等)
VSR模型量化，先基于 mmdetection 的检测模型进行流程的验证，卡在捕获计算图这一步，解决掉这一步才可以游刃有余的进行量化过程中插入 op 的操作。遇到以下两类细节上的问题，正在解决
- forward 过程存在输入动态控制流，torch.fx 不支持 trace 捕获此类计算图
- forward 存在对 inputs 调用len() method 而捕获过程会把 inputs 转换为抽象的 proxy 作为输入，proxy object 不支持调用 len() method
LLM / MLLM 量化，已经走通了常见模型 llama2-7b 的量化算法 AWQ SmoothQuant

后续计划 (24.01.07-24.01.21)

走通捕获计算图这一步
阅读 VSR 近期论文，了解最新进展
其它交代的事项

模型压缩与部署组工作进度 （2023.4.3-2023.4.9）

高扬城

李亚伟

后期计划

高扬城

李亚伟

模型压缩与部署组工作进度 （2023.7.03-2023.7.16）

李亚伟

苗康

后期计划

李亚伟

苗康

模型压缩与部署组工作进度 （2023.8.07-2023.8.13）

苗康

李亚伟

后期计划

苗康

李亚伟

模型压缩与部署组工作进度 （2023.8.14-2023.8.27）

苗康

李亚伟

后期计划

苗康

李亚伟

模型压缩与部署组工作进度 （2023.9.25-2023.10.08）

苗康

王明申

李亚伟

后期计划

苗康

王明申

李亚伟

模型压缩与部署组工作进度 （2023.10.09-2023.10.15）

苗康

王明申

李亚伟

后期计划

苗康

王明申

李亚伟

模型压缩与部署组工作进度 （2023.10.16-2023.10.22）

苗康

王明申

李亚伟

后期计划

苗康

王明申

李亚伟

模型压缩与部署组工作进度 （2023.10.23-2023.10.29）

苗康

王明申

李亚伟

后期计划

苗康

王明申

李亚伟

模型压缩与部署组工作进度 （2023.10.30-2023.11.05）

李亚伟

苗康

王明申

后期计划

苗康

王明申

李亚伟

模型压缩与部署组工作进度 （2023.11.06-2023.11.12）

苗康

王明申

李亚伟

后期计划

苗康

王明申

李亚伟

Workshop and Challenges @ CVPR 2023

MobileAI worshop: Video Super-Resolution

milestone_0:

milestone_1:

Procedure

Time:2023.2.7-2023.4.15

Paper Reading

Idea

模型压缩与部署组工作进度（2023.4.3-2023.4.9）

模型压缩与部署组工作进度（2023.7.03-2023.7.16）

模型压缩与部署组工作进度（2023.8.07-2023.8.13）

模型压缩与部署组工作进度（2023.8.14-2023.8.27）

模型压缩与部署组工作进度（2023.9.25-2023.10.08）

模型压缩与部署组工作进度（2023.10.09-2023.10.15）

模型压缩与部署组工作进度（2023.10.16-2023.10.22）

模型压缩与部署组工作进度（2023.10.23-2023.10.29）

模型压缩与部署组工作进度（2023.10.30-2023.11.05）

模型压缩与部署组工作进度（2023.11.06-2023.11.12）

进度汇报（2022.9.26-2022.10.9）

进度汇报（2022.10.10-2022.10.16）

进度汇报（2022.10.17-2022.10.23）

进度汇报（2022.10.24-2022.10.30）

进度汇报（2022.10.31-2022.11.06）

进度汇报（2022.11.07-2022.11.13）