MAI 2023 Mobile VSR Workshop Log

发表于 2023-02-17 更新于 2024-12-16

Workshop and Challenges @ CVPR 2023

Efficient Super-Resolution Challenge(ESR)

经典baseline:
- information multi-distillation block,IMDN (2019)
- Residual feature distillation block,RFDN (2020)
- Residual Local Feature Network,RLFN (ByteESR2022)
初期调试跑起来时，目录名称有一点变化就会在别处导致意想不到的错误:(
很多队伍都用到了Quantization Aware Training (QAT)
2022参赛上榜的网络结构和权重都有提供
results

Model	Dataset	Val PSNR	Val Time [ms]	Params [M]	FLOPs [G]	Acts [M]	Mem [M]	Conv
trained_rfdn_best	DIV2K_val(801-900)	28.73	37.62	0.433	27.10	112.03	788.13	64
RFDN_baseline_1	DIV2K_val(801-900)	29.04	41.38	0.433	27.10	112.03	788.13	64
RFDN_baseline_2	DIV2K_val(801-900)	29.04	43.86	0.433	27.10	112.03	788.13	64
RFDN_baseline_3	DIV2K_val(801-900)	29.04	37.59	0.433	27.10	112.03	788.13	64
RFDN_baseline_4	DIV2K_val(801-900)	29.04	34.20	0.433	27.10	112.03	788.13	64
IMDN_baseline_1	DIV2K_val(801-900)	29.13	45.11	0.894	58.53	154.14	471.78	43
IMDN_baseline_2	DIV2K_val(801-900)	29.13	45.03	0.894	58.53	154.14	471.78	43
IMDN_baseline_3	DIV2K_val(801-900)	29.13	44.44	0.894	58.53	154.14	471.78	43

Mobile AI workshop 2023

测试可以用自己手机，也可使用提供的远程设备(速度慢有延迟)
2022 tracks

Track	Sponsor	Evaluate_Platform	Final_Phase_Team/Participants
Bokeh Effect Rendering 背景虚化	Huawei	Kirin 9000’s Mali GPU	6/90
Depth Estimation	Raspberry Pi	Raspberry Pi 4	7/70
Learned Smartphone ISP	OPPO	Snapdragon’s 8 Gen 1	11/140
Image Super-Resolution	Synaptics	Synaptics VS680	28/250
Video Super-Resolution	MediaTek 联发科	MediaTek Dimensity 9000	11/160

2021 tracks

Track	Sponsor
Learned Smartphone ISP	MediaTek 联发科
Image Denoising	Samsung
Image Super-Resolution	Synaptics
Video Super-Resolution	OPPO
Depth Estimation	Raspberry Pi
Camera Scene Detection	Computer Vision Lab, ETH Zurich, Switzerland

计划参加track:Image Super-Resolution 3月份开始 -> 调整为track:Video Super-Resolution
Train 2021 anchor-based plain net (ABPN) 两次
- 200 epoch 时报错停掉一次
- 600 epoch 完整跑完，但loss上下波动不收敛
andriod对原作提供的TF-lite模型进行了测试,测试流程掌握了

MobileAI worshop: Video Super-Resolution

papers
1. Ntire 2019 challenge on video super-resolution: Methods and results
2. Ntire 2020 challenge on image and video deblurring
3. Pynet-v2 mobile: Efficient on-device photo processing with neural networks
  - Image Signal Process(ISP): 手机成像流程光->CMOS传感器->成像引擎ISP->AI(GPU)->图片；镜头和CMOS在将光学信号转化为由0、1、0、1组成的数字信号时可能存在细节上的遗漏和错误，而ISP单元的主要任务就是进行“纠错”、“校验”和“补偿”。
  - pynet模型便于移动端部署的mobile版本目的是end-to-end learned ISP,时间很近:2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
  - CNN based
4. Microisp: Processing 32mp photos on mobile devices with deep learning. In: European Conference on Computer Visio(2022)
5. Real-Time Video Super-Resolution on Smartphones with Deep Learning,Mobile AI 2021 Challenge: Report
  - Results and Discussion
    - Team Diggers 冠军方案基于Keras/Tensorflow 电子科技大学唯一一个使用循环连接（recurrent connections）来利用帧间依赖性获取更好重建结果，其他方案都是基于单帧超分的。
6. Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report
  - tutorial: https://github.com/MediaTek-NeuroPilot/mai22-real-time-video-sr. baseline:MobileRNN
  - scoring: Final Score = α · PSNR + β · (1 - power consumption) α = 1.66 and β = 50，注重PSNR和power consumption两个指标
  - Discussion:
    - The majority of models followed a simple single-frame restoration approach to improve the runtime and power efficiency. 大部分模型技术路线是降低单帧超分的运行时间和能量消耗，网络模型都比较浅
    - GenMedia Group(一家韩国公司) 基于上年度单帧超分冠军方案ABPN小改进而来，排名第6但psnr:28.40最好,是唯二psnr超过28的方案之一，另一个是221B团队基于RNN的方法
    - 基于RNN的方案推理速度较慢且能耗高
    - 总结：2022年来看设备上的视频超分CNN是适合的，因为CNN取得了runtime energy_consumption restoration_quality 的平衡
7. Sliding Window Recurrent Network for Efficient Video Super-Resolution
  - SWRN makes use of the information from neighboring frames to reconstruct the HR frame. 从相邻帧提取信息来重建高清帧,相比单帧超分的方法有丰富的细节。
  - An bidirectional hidden state is used to recurrently collect temporal spatial relations over all frames.使用双向隐藏状态来循环收集所有帧的时间空间关系。
  - Pioneer network: SRCNN
  - Video super-resolution: the most important parts are frame alignment
    - VESPCN and TOFlow: optical flow to align frames
    - TDAN and EDVR: deformable convolution. Especially, EDVR enjoys the merits of implicit alignment and its PCD module.
    - Incorporates recurrent networks, use the hidden state to record the important temporal information.
  - 在测试平台Runtime 10.1 ms、 0.80 W@30FPS,最后分数低问题就在这里，PSNR SSIM 比第一名MVideoSR（小米）都要好 -> 寻找加速计算和减小耗能的方法
8. Lightweight Video Super-Resolution for Compressed Video -> Compression-informed Lightweight VSR (CILVSR)
  - Recurrent Frame-based VSR Network (FRVSR, RBPN, RRN)
  - Spatio-Temporal VSR Network (SOF-VSR, STVSR, TDAN, TOFlow, TDVSR-L)
  - Generative Adversarial Network (GAN)-based SR Network
  - Video Compression-informed VSR Network (FAST, COMISR, CDVSR, CIAF)
9. RCBSR: Re-parameterization Convolution Block for Super-Resolution
  - ECBSR baseline
  - Multiple paths ECB re-parametrization
  - FGNAS
10. Deformable 3D Convolution for Video Super-Resolution
  - deformable 3D convolution
11. Efficient Image Super-Resolution Using Vast-Receptive-Field Attention(VapSR 有torch代码)
  - improving the attention mechanism
    - large kernel convolutions
    - depth-wise separable convolutions
    - pixel normalization -> train steadily
  - 相比bytedance的RLFN -> 性能sota,参数更少
12. LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices(无代码)
  - 针对手机端，结构简单，REDS 320x180 X4 upscaling -> psnr:27.51 ssim:0.769(有疑问这个结果到底是在手机上测出来的还是在手机上?)
  - REDS 320x180 X4 upscaling 执行速度快 139FPS -> FSRCNN: 45FPS ESPCN: 52FPS
  - 测试平台：Tensorflow-lite fp16 TF-Lite GPU delegate Xiaomi Mi 11 Qualcomm Snapdragon 888 SoC, Qualcomm Adreno 660 GPU, and 8 GB RAM
13. Fast Online Video Super-Resolution with Deformable Attention Pyramid
  - recurrent VSR architecture based on a deformable attention pyramid (DAP)
  - 对比RRN(mobile_rrn MAI VSR官方用例很慢) ->不适合用到MAI VSR中
    - Run[ms] fps[1/s] FLOPs[G] MACs[G]
      
      28 35.7 387.5 193.6
      
      38 26.3 330.0 164.8
2022 challenge methods (ranked)
1. MVideoSR(无代码)
  - paper title: ELSR: Extreme Low-Power Super Resolution Network For Mobile Devices
  - affiliation: Video Algorithm Group, Camera Department, Xiaomi Inc., China
  - methods:
    1. core idea: mobile friendly network which consumes as little energy as possible, discard some complex operations such as optical flow, multi-frame feature alignment, and start from single frame baselines.
    2. multi-branch distillation structure show significant increase in energy consumption while a slight increase in PSNR compared with the plain convolutional network of similar parameters. abandon multi-branch network architectures, and focus on plain convolutional SR networks.
    3. though attention modules(ESA, CCA and PA) bring performance improvement, the extra energy consumption introduced is still unacceptable
    4. architeture
      - discription: single frame input which only have 6 layers, of which only 5 have learnable parameters, including 4 Conv layers and a PReLU activation layer. Pixel-Shuffle operation (also known as depth2space) is used at last to upscale the size of output without introducing more calculation. The intermediate feature channels are all set to 6.
2. ZX VIP(无代码)
  - paper title: RCBSR: Re-parameterization Convolution Block for Super-Resolution
  - affiliation: Audio & Video Technology Platform Department, ZTE Corp., China
  - methods:
    1. core idea: trade-off between SR quality and the energy consumption, ECBSR as baseline. In consideration of the low power consumption optimize the baseline from three aspects,network architecture, NAS and training strategy.
    2. network architecture:re-parameterization technique in the deploy stage, replace the activate function PReLU with ReLU.the power consumption of tflite model with ReLU is less than PReLU. Meanwhile there is no apparent discrepancy in PSNR.Finally, in order to further reduce power consumption, the output of first CNN layer is added into the backbone output instead of original input because original input needs to be copied the number of channels. We use sub-pixel convolution to upsample image in the network.
    3. NAS: The objective function of FGNAS is task-specific loss and regularizer penalty FLOPs. FGNAS -> Kim, H., Hong, S., Han, B., Myeong, H., Lee, K.M.: Fine-grained neural architecture search. arXiv preprint arXiv:1911.07478 (2019)
    4. training strategy:replace L1 loss function with Charbonnier loss function because it causes the problem that the restored image is too smooth and lack of sense of reality.
    5. architeture
3. Fighter(无代码)
  - title: Fast Real-Time Video Super-Resolution
  - affiliation: None, China
  - methods:
    1. shallow CNN model with depthwise separable convolutions and one residual connection. The number of convolution channels in the model was set to 8, the depth-to-space op was used at the end of the model to produce the final output.
    2. architeture
4. XJTU-MIGU SUPER(无代码)
  - title: Light and Fast On-Mobile VSR
  - affiliation: School of Computer Science and Technology, Xi’an Jiaotong University, China MIGU Video Co. Ltd, China
  - methods:
    1. small CNN-based model. 示意图如下，总共训练了2600 epochs :(
    2. architeture
5. BOE-IOT-AIBD(无代码)
  - title: Lightweight Quantization CNN-Net for Mobile Video Super-Resolution
  - affiliation: BOE Technology Group Co., Ltd., China
  - methods:
    1. based on the CNN-Net architecture, its structure is illustrated in Fig 6. The authors applied model distillation, and used the RFDN CNN as a teacher model.
    2. architeture
6. GenMedia Group(无代码)
  - title: SkipSkip Video Super-Resolution
  - affiliation: GenGenAI, South Korea
  - methods:
    1. inspired by the last year’s top solution from the MAI image super-resolution challenge. added one extra skip connection to the mentioned anchor-based plain net (ABPN) model.
    2. architeture
7. NCUT VGroup(无代码)
  - title: EESRNet: A Network for Energy Efficient Super Resolution
  - affiliation: North China University of Technology, China Institute of Automation, Chinese Academy of Sciences, China
  - methods:
    1. also based their solution on the ABPN model.
    2. architeture

Run[ms]	fps[1/s]	FLOPs[G]	MACs[G]
28	35.7	387.5	193.6
38	26.3	330.0	164.8

ideas

尝试BasicVSR++的轻量化
在ABPN的基础上加入BasicVSR++的主要idea进行改进
尝试将Pynet_v2应用于video super_resolution -> relative complicated and tailored for ISP, so halt

先train MRNN baseline

环境

Python==3.8.10
Tensorflow-gpu==2.9.0
- 查看tensorflow cuda cudnn python 版本对照表： https://www.tensorflow.org/install/source_windows
Cuda==11.2
- CUDA: CUDA是一个计算平台和编程模型，用于在GPU上加速应用程序。CUDA版本指的是CUDA软件的版本
- CUDA Toolkit: CUDA Toolkit是包含CUDA库和CUDA工具链的软件包，用于开发和编译CUDA应用程序。
  - CUDA库: CUDA 库包含了 CUDA 编程所需的核心库文件，例如 CUDA Runtime 库、CUDA Driver 库、cuBLAS 库、cuDNN 库等。这些库文件提供了 GPU 加速的基本功能和算法，是 CUDA 编程的基础。
  - CUDA工具链：CUDA 工具链则包含了一系列辅助开发和调试 CUDA 程序的工具，例如 nvcc 编译器、CUDA-GDB 调试器、Visual Profiler 性能分析工具等。这些工具能够帮助开发者更方便地编写、调试和优化 CUDA 程序。
- note: 查看当前安装的显卡驱动最高支持的CUDA版本 nvidia-smi
- note: 查看CUDA工具链版本 nvcc –version
- CUDA Toolkit 与 Driver Version 对照：https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Cudnn==v8.7.0

官网：https://developer.nvidia.com/cudnn
cat /etc/os-release 查看linux版本
uname -m 查看cpu架构，cudnn有不同架构的版本 x86_64 PPC SBSA
tar -xvf解压缩后用以下命令安装并赋予所有用户读取权限

#!/bin/bash
sudo cp path_to_cudnn/include/cudnn*    /usr/local/cuda-11.2/include
sudo cp path_to_cudnn/lib/libcudnn*    /usr/local/cuda-11.2/lib64
sudo chmod a+r /usr/local/cuda-11.2/include/cudnn*   /usr/local/cuda-11.2/lib64/libcudnn*

Cudnn和Cuda 安装完需在/etc/profile配置环境变量PATH和LD_LIBRARY_PATH

1
2
3

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin
export CUDA_HOME=/usr/local/cuda

可将文件夹 /usr/local/cuda-11.2 与 /usr/local/cuda 软连接起来

1	ln -s /usr/local/cuda-11.2 /usr/local/cuda

也可以通过linux下的update-alternatives命令行工具来进行cuda版本的管理,先用sudo update-alternatives --install /usr/local/cuda(替代项名称) cuda(替代项链表名称) /usr/local/cuda-xx(实际路径) x(优先级)来安装配置cuda的多个替代项,sudo update-alternatives --config cuda切换CUDA默认版本,其本质是更改了以下软连接: /usr/local/cuda -> /etc/alternatives/cuda -> /usr/local/cuda-xx.x
用下面的命令查看cudnn版本,新版本查看cuDNN版本的命令为

1	cat /usr/local/cuda/include/cudnn_version.h \| grep CUDNN_MAJOR -A 2 # -A 选项用来指定匹配成功的行之后显示2行内容

结果
1. 用默认config.yml训练太慢了大约需要1周时间，中途停掉了
2. 用改进后config.yml训练。8小时左右训练完成，但是loss很大
3. 结合往年此赛道总结文章放弃训练提供的mobilernn baseline 思考其它基于cnn的模型

可以从NTIRE 2022 efficient super-resolution challenge选取baseline运用剪枝蒸馏等改进到移动端
- course project for NCSU’s Computer Science 791-025: Real-Time AI & High-Performance Machine Learning. 三板斧
  1. Pruning via NNI
  2. Quantization via NNI
  3. Hyper Parameter Optimization via NNI
  4. Color Optimization: RGB -> YCbCr
- 选取2022 NTIRE ESR冠军方案RLFN(Byte Dance)作为baseline,先将其模型转换为 tensorflow 版本在 REDS 数据集上直接进行VSR的测试 -> 中间软件依赖兼容性问题放弃RLFN torch->onnx->tensorflow路线
- 直接用tensorflow 重构 RLFN -> train成功但是精度不达标’psnr’: 25.574987, ‘ssim’: 0.69084775，需要调试改进
- 现在的首要问题是确定自己的tensorflow 版本RLFN 与原作的 torch 版本RLFN 是否一致 -> cease
- 可以先将其他模型利用torch_to_tensorflow 转化为tensorflow版本模型，并可视化查看效果 -> 可行而且看源代码不复杂，难点在torch onnx onnx-tf tensorflow-gpu 版本对照，静等比赛开始官方scripts
- 现在当务之急不是版本对照问题需要尽快找到往年的baseline跑起来，改起来 -> 跑此项目了解剪枝量化超参调整三板斧实际运用：https://github.com/briancpark/video-super-resolution.git -> 都是在调库 NNI
- Train baseline SWRN：https://github.com/shermanlian/swrn
  - 结构重参数化（structural re-parameterization）:用一个结构的一组参数转换为另一组参数，并用转换得到的参数来参数化（parameterize）另一个结构。只要参数的转换是等价的，这两个结构的替换就是等价的。
  - 先测试提供的ckpt-98 -> 测试结果’psnr’: 27.931335, ‘ssim’: 0.7803563
  - 缩减recon_trunk_forward / recon_trunk_backward / recon_trunk 的 block_num到2, train from scratch 看结果
- 按照去年赛道冠军方案MVedioSR的ELSR搭建pipeline
  - L1 loss(Mean Absolute Error, MAE) -> 样本预测值与标签之间差的绝对值取平均, 对异常值不敏感,鲁棒性更强; 对于接近零的数, 梯度为常数, 没有逐渐变小的趋势, 容易出现震荡现象
  - L2 loss(Mean Squared Error, MSE) -> 样本预测值与标签之间平方差取平均, 对异常值敏感,鲁棒性不强; 对于接近零的数, 梯度随着误差的减小而逐渐减小, 避免了震荡现象。
  - TensorFlow中的内置损失函数非常丰富，包括L1、L2、L1_Charbonnier和MSE等常见的损失函数。这些损失函数都在tf.keras.losses模块中实现。具体来说，可以使用以下函数调用这些损失函数：
    - L1损失函数：tf.keras.losses.mean_absolute_error(y_true, y_pred)
    - L2损失函数：tf.keras.losses.mean_squared_error(y_true, y_pred)
    - L1_Charbonnier损失函数：可以自定义实现，也可以使用以下库中的实现：TensorFlow Addons（需要单独安装）。
    - M2损失函数：tf.keras.losses.mean_absolute_percentage_error(y_true, y_pred)
    - note: 这些函数的参数都是y_true和y_pred，分别表示真实值和预测值。
  - L1 Loss: L1 Loss: $L_1 = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y_i} \right|$
  - L2 Loss (MSE): $L_2 = \frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y_i} \right)^2$
  - L1 Charbonnier Loss: $L_{Charbonnier} = \frac{1}{N} \sum_{i=1}^{N} \sqrt{ \left( y_i - \hat{y_i} \right)^2 + \epsilon^2 }$
  - M2 Loss: $L_{M2} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{y_i - \hat{y_i}}{y_i + \epsilon} \right)^2$，其中 $\epsilon$ 为一个较小的数，如 $10^{-6}$，用于防止分母为零。
可以看看tflite加速的那些operations去更改模型
尝试将torch VapSR 从单图像超分向视频超分迁移
- 构建模型的tensorflow 代码遇到个小坑: tf.keras.Sequential([upconv1,pixel_shuffle,lrelu,upconv2,pixel_shuffle]) 如果用两个一样的pixel_shuffle模块，用tf.keras.Sequential实现的时候必须用两个不一样的名称，否则无论如何Sequential内都只有一个pixel_shuffle模块
- ‘from .XXX import YYY’ 相对导入，Python 解释器会先从当前目录开始查找指定的模块或包,需要当前current.py文件在一个Python包内（创建一个空的 init.py 文件，即可将文件夹视为一个Python包）
- [B,H,W,48] - conv1X1升维 -> [B,H,W,64], conv1X1为[64,48,1,1]大小的Tensor
- blueprint conv(M: input_channels N: out_channels); BSconvU: 先用M X 1权重向量对输入作通道聚合, 变为只有1个feature map,然后再用N个K X K的卷积输出N个feature map.
对VapSR_2 剪枝量化
- 列表推导式：list = [expression for item in iterable]，其中 expression 是要添加到列表中的表达式，item 是可迭代对象中的每一项，iterable 是要迭代的对象。例如：metric_list = [func for name, func in self.metric_functions.items()]
- tfmot.sparsity.keras.prune_low_magnitude() 封装vapsr_3中的每一个tf.keras.layers.Conv2D进行剪枝
- 同一class下def的method默认第一个参数需要为self;一个method调用另一个method需要用 self.def()不能直接用 def()
- keras建立网络的方法可以分为keras.models.Sequential() 和keras.models.Model()、继承类三种方式。注意：tensorflow2.* 以后的版本可以直接使用tf.keras.Sequential()和tf.keras.Model()两个类。不用再使用keras.models的API
  - Keras提供两种API：Sequential API和Functional API。Sequential API是一种简单的线性堆叠模型，适用于许多简单的模型。但是，如果我们需要构建更加复杂的模型，比如有多个输入或输出的模型，那么就需要使用Functional API。
  - Functional API通过tf.keras.Model()实现，它提供了更加灵活的方式来定义模型的结构和层之间的连接。使用Functional API，我们可以创建具有多个输入和输出的模型，可以共享层，可以定义任意的计算图结构等等。相比之下，Sequential API则不能支持这些更高级的模型定义方式。
  - 因此，使用Functional API来构建复杂的模型是更加灵活和强大的选择，而通过tf.keras.Model()实现这个API是为了提供一种方便和一致的方式来定义和构建深度学习模型。
- / 表示普通的除法运算，例如 5 / 2 的结果为 2.5。它返回的是一个浮点数，即使两个操作数都是整数。 //表示整除运算，例如5 // 2 的结果为 2。
- =和+直接赋值给变量是不好的，因为它们只是简单地创建一个新的变量，而不是对现有变量进行原位操作。assign()和assign_add()是TensorFlow中的原地操作，它们直接将结果分配给现有变量，而不是创建一个新的变量。
- shell scripts(.sh)添加多行注释：<< COMMENT ... COMMENT, 在 Shell 中，<< 是 Here Document（文档嵌入）的语法，它可以用来将一段文本或代码块嵌入到 Shell 脚本中。
- pruning 过程model type 变化
  1. initial: type(self.model) == <class ‘VapSR_3.vapsr_3’> (i.e. Keras Subclass Model)
    - Keras Subclass Model是一种创建自定义模型的方式，相较于Sequential和Functional API而言，其提供更大的灵活性。使用Subclass Model，用户可以通过定义一个继承自tf.keras.Model的Python类来构建模型。使用Subclass Model的优点在于，它可以自由灵活地创建非线性、复杂的模型结构，也可以方便地重复利用模型代码。
  2. apply tensorflow.keras.Model() method -> type(functional_model) == <class ‘keras.engine.functional.Functional’>
  3. add tfmot.sparsity.keras.prune_low_magnitude() wrapper -> type(pruned_model) == <class ‘keras.engine.functional.Functional’>; 如果直接调用tfmot.sparsity.keras.prune_low_magnitude(functional_model, **pruning_params)还是会报错：ValueError: Subclassed models are not supported currently. :(
  4. add tfmot.sparsity.keras.prune_low_magnitude() wrapper with another method -> type(pruned_model_1) == <class ‘keras.engine.functional.Functional’>
  5. pruning -> type(pruned_model) == <class ‘keras.engine.functional.Functional’>
  6. 虽然type(pruned_model) == <class ‘keras.engine.functional.Functional’>，但是传入stripped_pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)就会报错：ValueError: Expected model argument to be a functional Model instance, but got a subclassed model instead: <keras.saving.saved_model.load.BSConvU object at 0x7f66f06fa520>
  7. pruned_model.layers == [<keras.engine.input_layer.InputLayer object at 0x7f66f06fa580>, <keras.saving.saved_model.load.BSConvU object at 0x7f66f06fa520>, <keras.engine.sequential.Sequential object at 0x7f66f06fa4f0>, <keras.layers.convolutional.conv2d.Conv2D object at 0x7f66f0717c40>, <keras.layers.core.tf_op_layer.TFOpLambda object at 0x7f66f80a7fd0>, <keras.engine.sequential.Sequential object at 0x7f66f80a7cd0>]
- pruning 去除tfmot.sparsity.keras.prune_low_magnitude() wrapper的报错就没停过 -> 直接构建Functional Model VapSR
- 详细解释 Layer Norm / Batch Norm / Instance Norm / Pixel Norm
  - Batch Norm：对每个特征通道（C）进行归一化，使用整个批次（N）中的样本的均值和方差。在每个 batch 的 channel 维度上计算均值和方差。
  - Layer Norm：对每个样本（N）进行归一化，使用所有特征通道（C）和空间维度（H，W）的均值和方差。在每个 layer 的所有 feature maps 上计算均值和方差。
  - Instance Norm：对每个样本（N）和每个特征通道（C）进行归一化，使用空间维度（H，W）的均值和方差。在每个 instance 的 channel 维度上计算均值和方差。
  - Pixel Norm：对每个样本（N）和每个像素位置（H，W）进行归一化，使用所有特征通道（C）的均值和方差。
  - 组会可以讨论下具体实现（作为最后一小部分）
  - VapSR 原作者 pixel norm torch 实现
  - ```
   class VAB(nn.Module):
       def __init__(self, d_model,d_atten):
           super().__init__()
           self.proj_1 = nn.Conv2d(d_model, d_atten, 1)
           self.activation = nn.GELU()
           self.atten_branch = Attention(d_atten)
           self.proj_2 = nn.Conv2d(d_atten, d_model, 1)
           self.pixel_norm = nn.LayerNorm(d_model)
           default_init_weights([self.pixel_norm], 0.1)
       
       def forward(self, x):
           shorcut = x.clone()
           x = self.proj_1(x)
           x = self.activation(x)
           x = self.atten_branch(x)
           x = self.proj_2(x)
           x = x + shorcut
           x = x.permute(0, 2, 3, 1) #(B, H, W, C)
           x = self.pixel_norm(x)
           x = x.permute(0, 3, 1, 2).contiguous() #(B, C, H, W)
  
           return x
       参考：https://blog.csdn.net/weixin_39228381/article/details/107939602    
```
- x = tf.constant([[1.,2.,4.,5.,7.,8.],[6.,7.,9.,10.,11.,12.],[2.,3.,5.,6.,8.,9.],[4.,5.,7.,8.,10.,11]])
- mean, variance = tf.nn.moments(x, axes, shift=None, keepdims=False, name=None) The mean and variance are calculated by aggregating the contents of x across axes. 例： tf.nn.moments(x,1) x.shape == [4,2,3] -> mean.shape == [4,1,3]
- 后续还需要花时间搞清楚 tf LayerNormalization GroupNormalization 在axis=list/tuple多轴的情况下，到底计算了多少mean和variance，换言之如何用这两个built-in layer做到随心所欲的控制normalization的粒度,妥协方法我觉得是利用transpose,转换轴(相当于torch permute)间接实现相关功能。
- *args 和 **kwargs 都是 Python 中用于传递可变数量参数的特殊语法。它们的主要区别在于：
  - *args 用于传递可变数量的位置参数，以元组(tuple)的形式传递给函数；
  - **kwargs 用于传递可变数量的关键字参数，以字典(dictionary)的形式传递给函数。
- 接下来需要尽快完成 pruning clustering quantization pipeline, 将runtime降到10ms左右
- 递归函数的return不是返回一个值然后程序结束，而是返回一个值到上一层的递归函数，直到return到最外层
- add_pruning_wrapper():
  - 通过Sequential.add()重建模型,在原模型就是Sequential的时候可行,但是原模型call() method加不进去
  - 原地替换setattr(object, name, new_model)难点:
    1. 递归当前tf.keras.layers.Conv2D不知道所属模块object 和 name
    2. pruned_model = copy.deepcopy(model)在复制的pruned_model上应用剪枝封装, subclassed tf.keras.Model() class -> custom object 需要全部重写method: get_config() from_config()
  - model.__dict__ 与dir(model) 区别
    1. model.__dict__ 返回一个字典对象，其中键是模型实例的属性名称(可用model.__dict__.keys()访问)，值是对应的属性值(可用model.__dict__.values()访问)。而 dir(model) 返回一个列表对象，其中包含模型实例的所有属性名称。
    2. 具体来说，model.__dict__ 只返回实例自身定义的属性，不包括其继承而来的属性。而 dir(model) 返回实例的所有属性名称，包括其自身定义的属性和继承而来的属性。
    3. model.__dict__ 返回的字典对象只包含可写的属性。而 dir(model) 返回的属性列表可能包含不可写的属性，例如只读属性或方法等。
  - pruned_model.layers[3] == <keras.layers.convolutional.conv2d.Conv2D object at 0x7ff557c57a60> 这一层是 Keras 自带的 Conv2D 层，而不是通过继承 tf.keras.layers.Layer 类来自定义的。因此，它不会在 __dict__ 属性中出现。
- strip_pruning_wrapper():
  - tfmot.sparsity.keras.strip_pruning(): Only sequential and functional models are supported for now.
  - recursively strip pruning wrapper -> success
- lr_scheduler: ConsineDecayRestarts
- pruning_train, clustering_train loss 与 pretraining train loss 相差很大, 50+ vs 10+ 有点问题
- quantization
  1. tensorflow quantize:
    - def quantize_scope(*args)
    - def quantize_model(to_quantize, quantized_layer_name_prefix=’quant_’)
    - def quantize_annotate_model(to_annotate)
    - def _add_quant_wrapper(layer)
    - def quantize_annotate_layer(to_annotate, quantize_config=None)
    - def quantize_apply(model, scheme=default_8bit_quantize_scheme.Default8BitQuantizeScheme(), quantized_layer_name_prefix=’quant_’)
    - def _extract_original_model(model_to_unwrap)
    - def _quantize(layer)
    - def _unwrap_first_input_name(inbound_nodes)
    - def _wrap_fixed_range(quantize_config, num_bits, init_min, init_max, narrow_range)
    - def _is_serialized_node_data(nested)
    - def _nested_to_flatten_node_data_list(nested)
    - def fix_input_output_range(model, num_bits=8, input_min=0.0, input_max=1.0, output_min=0.0, output_max=1.0, narrow_range=False)
    - def _is_functional_model(model)
    - def remove_input_range(model)
  2. *与**二者区别,及与C++ 中指针的区别:
    - * 和 ** 都是Python中的特殊符号，用于参数传递和元组、字典的解包操作。它们与C++中的指针有些类似，但也有不同之处。
    - * 用于元组的解包操作，可以将一个元组中的元素解包成一个一个的单独元素
    - ** 用于字典的解包操作，可以将一个字典中的键值对解包成一个一个的单独键和值
    - 在函数调用时，* 可以用于传递可变数量的位置参数，而 ** 可以用于传递可变数量的关键字参数，如: def foo(*args, **kwargs): …
    - 与C++中的指针类似，* 可以用于声明指针类型的变量，而 ** 则可以用于声明指向指针的指针类型的变量。但与C++不同的是，Python中的指针实际上是对象的引用，而不是内存地址，因此没有C++中的指针算术运算和指针类型转换等操作。
      - 与 C++ 不同的是，Python 中的对象引用是一个高级抽象，它们隐藏了对象的实际内存地址，因此 Python 中的引用和指针不是同一概念。在 Python 中，我们不需要显式地管理内存，而是由 Python 解释器自动处理内存管理的细节。因此，Python 中的引用更像是一个符号，它与实际的内存地址之间存在一个间接的映射关系。
  3. self 与 cls:
    - cls 是 Python 中类方法的第一个参数的常规名称。它指的是类本身而不是类的实例。它类似于在实例方法中使用 self。
    - 在类方法中，cls 用于访问类级别的属性和方法，以及创建类的新实例。
  4. 修好bug,在手机上测好 runtime; 目标: PSNR -> 28, SSIM -> 0.8, runtime -> 30ms
    - 从VapSR_3_2开始在手机上都跑不通runtime测试了
    - 通过tf.lite.TFLiteConverter.from_saved_model(‘path_to_model’)创建converter,转换为tflite模型后可以通过netron查看模型结构并分析可能的错误
    - 使用tf.lite.TFLiteConverter.from_keras_model()或者tf.lite.TFLiteConverter.from_saved_model()使用创建converter的话总会遭遇两个问题
      1. model input_size: [1,1,1,3] output_size[1,1,1,3] 异常
      2. Make sure you apply/link the Flex delegate before inference.
      3. 综上推荐配合model.save(‘path_to_model’)存为SavedModel格式，然后定义好concrete_func = model.signatures[tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY ],使用tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])规避掉这两个问题
  5. 通过QuantizeConfig和Quantizer配合实现layer activations weights的自定义量化策略
  6. 上下文管理器用于管理某个代码块的上下文环境
    - Python 中常见的上下文管理器包括 with open() as f 中的 open() 函数和 with tf.Session() as sess 中的 tf.Session() 函数等。
    - 在 with 代码块结束后，Python 会自动调用上下文管理器的 __exit__ 方法，以确保资源的释放和清理等工作的完成。同时，上下文管理器可以在 __enter__ 方法中完成一些初始化工作。在 with 代码块内部，可以使用上下文管理器返回的对象，来操作上下文环境中的资源
tensorflow复现高通torch QuickSRNet 8-bit 量化
- android_aarch64代表的是基于64位ARM架构的Android设备，也被称为ARMv8-A架构.通常用于高端设备，如智能手机和平板电脑。
- android_arm代表的是基于32位ARM架构的Android设备。通常用于低端设备，如廉价智能手机、平板电脑和物联网设备
困扰了至少3周的bug： TFlite GPU Delegate init Batch size mismatch -> solved
- 根据this link提前规避了tflite gpu delegate不支持全连接层，即利用1*1全连接层替代
- 从0到1一点点逐个测试可能出问题的模块，最终定位在pixel norm模块(由tf.reshape和tf.keras.layers.LayerNormalization构成)，换为LayerNormalization得到解决，PSNR甚至有一点点提升:)
奇怪的问题，在转换Functional Model为tflite模型时，import tensorflow.keras.backend as K 在模型中使用k.clip()时总是提示K未定义 -> 直接更换为tf.keras.backend.clip()解决
在训练Mobile VSR小模型时，GPU利用率低的问题
1. 不是由于CPU读取处理数据慢造成的，增加线程无效
2. 也不是batch size大造成的，减小batch size无效
3. 想要提高GPU利用率估计有两个途径,一是增大模型而是使用nvidia DALI数据读取加速库
感受野(receptive field) 计算
- 假设输入图像大小为$W_{in}\times H_{in}$，卷积核大小为$k\times k$，步长为$s$，当前卷积层的感受野大小为$F_{in}$，则下一层的感受野大小$F_{out}$为：
  
  $F_{out} = F_{in} + (k - 1) \times \text{dilation rate}$
  
  其中，$\text{dilation rate}$表示卷积核的膨胀率，如果不使用膨胀卷积，则$\text{dilation rate} = 1$。如果下一层是池化层，则$s = k$，并且不考虑膨胀率。
  
  设输入图像大小为$224\times 224$，第一个卷积层使用$3\times 3$大小的卷积核，步长为1，不使用膨胀卷积。则第一个卷积层的感受野大小为$3$。

results

milestone_0:

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]
SWRN_0	Origin	REDS	27.931335	0.7803562	43,472	25.6
SWRN_1	recon_trunk block num=2	REDS	27.820051	0.77666414	36,512	26.9
ELSR_0(vsr 22 winner)	Origin	REDS	26.716854	0.73988235	3,468	19.3
RLFN_0(esr 22 winner)	Origin	REDS	26.78721	0.7389487	306,992	-
VapSR_0	Origin	REDS	28.103758	0.7864979	154,252	5191.0
VapSR_1	Replace feature extraction conv and VAB’s 2 con1X1 with blueprint conv	REDS	28.02941	0.7845887	155,916	5798.0
VapSR_2	Replace feature extraction conv with blueprint conv and reduce Attention’s kernel size=3X3	REDS	28.021387	0.7831156	131,276	2694.0
VapSR_3	Correct custom realization of pixel normalization	REDS	28.018507	0.7836466	131,276	-
VapSR_3_1	Reduce VAB blocks from 11 to 5	REDS	27.826998	0.7771207	73,484	1222.0
VapSR_3_2	Realize Pixel Normalization with tf.reshape() and tf.keras.layers.LayerNormalization(); Reduce VAB blocks from 5 to 4	REDS	27.550034	0.7687168	64,108	error
VapSR_4	apply pruning, weights clustering to conv kernels	REDS	27.833515(suspect)	0.7771123(suspect)	32,054(64,108)
VapSR_4_2_0	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: GELU	REDS	27.666351	0.77187574	64,108
VapSR_4_2_1	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: RELU	REDS	27.539206	0.7669671	64,108
VapSR_4_3	Functional VapSR_4 with self customed pixel normalization get rid of layer normalization	REDS	27.651005	0.7715401	63,852

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: FP16
Acceleration: TFLite GPU Delegate

milestone_1:

Model	Description	Dataset	Val PSNR	Val SSIM	Params	Runtime on oneplus7T [ms]	FLOPs [G]
VapSR_4_1	Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: RELU, Attention using Partial conv	REDS	27.790268	0.77721727	59,468	654.0 (INT8_CPU)	7.462
SWAT_0	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=4)	REDS	27.842232	0.77754354	50,624	271.0 (FP16_CPU)	5.803
SWAT_1	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=2), replace fc with 1*1 conv	REDS	27.759375	0.77492595	33,984	252.0 (FP16_CPU)	3.900
SWAT_2	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv	REDS	27.760305	0.77487457	25,664	-	-
SWAT_3	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.761642	0.7748446	25,664	27.8 (FP16_TFLite GPU Delegate)	2.949
SWAT_3_1	Sliding Window, VAB Attention(large reception field=17), Partial Conv(point_wise: standard, depth_wise: groups=out_dim), Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization	REDS	27.754656	0.77461684	24,160	30.0 (FP16_TFLite GPU Delegate)	2.776
SWAT_3_2	Sliding Window, VAB Attention(receptive field=17), Partial Conv(feature fusion maintains standard conv11), Channel Shuffle(mix_ratio=2), replace fc with 11 conv, replace pixel normalization with layer normalization	REDS	27.74189	0.7742521	26,016	32.4 (FP16_TFLite GPU Delegate)	2.996
SWAT_4	Sliding Window, VAB Attention, Replace partial conv with standard convlution, Remove Channel Shuffle, replace pixel normalization with layer normalization	REDS	27.785185	0.77523285	53,696	38.5 (FP16_TFLite GPU Delegate)	6.202
SWAT_5	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 250,000	REDS	27.811176	0.7763541	25,664	27.6 (FP16_TFLite GPU Delegate)	2.949
SWAT_6	Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 150,000, Remove convs of hidden forward/backward	REDS	27.738842	0.7743317	21,056	23.6 (FP16_TFLite GPU Delegate)	2.417

AI benchmark setting for Runtime test:

Input Values range(min,max): 0,255
Inference Mode: INT8/FP16
Acceleration: CPU/TFLite GPU Delegate