Mobile Video Super-Resolution Work Log
Time:2023.2.7-2023.4.15
Paper Reading
PD-Quant: Post-Training Quantization based on Prediction Difference Metric(2022.12)
- 分析优化量化参数S Z用的各个Local Metrics (MSE or cosine distance of the activation before and after quantization in layers)
- PD Loss: 引入Prediction Difference决定Activation Scaling Factors
- Distribution Correction (DC): intermediate adjust the activation distribution on the calibration dataset
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation
Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization(2022CVPR)
Computer Vision – ECCV 2022 Workshops
- Learning Multiple Probabilistic Degradation Generators for Unsupervised Real World Image Super Resolution (无监督图像超分)
- Evaluating Image Super-Resolution Performance on Mobile Devices: An Online Benchmark (SR模型直接部署基准测试)
- Efficient Image Super-Resolution Using Vast-Receptive-Field Attention (大感受野Attention图像超分)
- DSR: Towards Drone Image Super-Resolution (无人机图像超分)
- Image Super-Resolution with Deep Variational Autoencoders (变分自动编码器用于SISR)
- Light Field Angular Super-Resolution via Dense Correspondence Field Reconstruction (光场角超分辨率)
- CIDBNet: A Consecutively-Interactive Dual-Branch Network for JPEG Compressed Image Super-Resolution (JPEG压缩图像超分)
- XCAT - Lightweight Quantized Single Image Super-Resolution Using Heterogeneous Group Convolutions and Cross Concatenation (单图像超分)
- RCBSR: Re-parameterization Convolution Block for Super-Resolution (结构重参数视频超分)
- Multi-patch Learning: Looking More Pixels in the Training Phase (多patch训练策略SISR)
- Fast Nearest Convolution for Real-Time Efficient Image Super-Resolution (Nearest Convolution替代copy原图像用于depth_to_space操作)
- Real-Time Channel Mixing Net for Mobile Image Super-Resolution (单图像超分:channel mixing using 1*1 conv)
- Sliding Window Recurrent Network for Efficient Video Super-Resolution (视频超分)
- EESRNet: A Network for Energy Efficient Super-Resolution (视频超分)
- HST: Hierarchical Swin Transformer for Compressed Image Super-Resolution (压缩图像超分)
- Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration (压缩图像超分)
Video Super-Resolution With Convolutional Neural Networks(2016)
- 将当前帧与相邻帧简单concate,提升超分质量
Frame-Recurrent Video Super-Resolution(2017)
- 利用前帧预测的HR结果补偿当前帧超分
Enhanced Deep Residual Networks for Single Image Super-Resolution(2017)
- ResBlock: 相较之前的工作减少ReLU等激活的使用
- Upsample: conv+shuffle
TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution(CVPR2020)
- 时序可变形卷积对齐网络用于缓解超分的伪影现象
Efficient Reference-based Video Super-Resolution (ERVSR): Single Reference Image Is All You Need(WACV2023)
- 单个参考帧来超分整个低分辨率视频序列,不使用每个时间步的LR帧作为参考,而只用中心时间步的一帧作为参考
- 基于注意力机制做相似性估计和对齐操作
- 动机:加速推理,减少内存消耗
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond(CVPR2021)
BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment(CVPR2022)
Multi-scale attention network for image super-resolution(ECCV2018)
- Multi-scale cross block(MSCB) 3个并行但不同dilation的卷积提取特征并融合
- Multi-path wide-activated attention block(MWAB) 3个并行支路: 卷积 + spatial attention + channel attention concate
- 缺点: 常规的channel attention采取的global average pooling 不一定能实现正确考虑通道间相关性的目的
Deep Video Super-Resolution using Hybrid Imaging System(2023)
- 任务: 利用一段LR高帧率视频(main video)和一段HR低帧率视频(auxiliary video)重建HR高帧率视频
- 模型3部分:
- 主视频超分产生基础的高清帧
- 辅助视频细节特征提取并进行对齐
- 混合视频信息聚集融合
STDAN: Deformable Attention Network for Space-Time Video Super-Resolution(2023)
- 变形注意力网络 deformable attention network
- 长短距离特征插值 long short-term feature interpolation (LSTFI)
- 时空变形特征聚集 spatial–temporal deformable feature aggregation (STDFA)
ShuffleMixer: An Efficient ConvNet for Image Super-Resolution(NTIRE2022)
- large convolution and channel split-shuffle operation 大卷积核搭配通道分割-混合操作
- add the Fused-MBConv after every two shuffle mixer layers 两层shuffle-mixer层之后接Fused-MBConv层克服局部特征提取不完善的问题
An Implicit Alignment for Video Super-Resolution (ArXiv 2023)
- static upsample evolution: 静态插值上采样如 bilinear、nearest插值的动态化演进
- implicit attention based alignment integrate with local window key&value position encoding and query(motion estimation/flow) position encoding: 基于注意力隐式对齐并结合局部窗口键值位置编码和运动补偿位置编码
Rethinking Alignment in Video Super-Resolution Transformers (NIPS 2022)
- 矩阵点乘:tf.multiply(A,B) = A * B
- 矩阵叉乘:tf.matmul(A,B) = A @ B
Idea
发现以前文章的问题尝试改进和解决 -> 单纯比较runtime必败
transformer PTQ -> 暂时不考虑, 专心workshop提性能
从第一个work出paper的角度,可以考虑新的压缩方面的idea应用于MAI video super resolution
- dataset -> train: REDS, test: REDS4(Clips 000, 011, 015, 020 of REDS training set)
- mobile video super resolution related paper
- frontier -> Optical Flow
尝试blind video super resolution -> 放弃
Compared Solutions Model Size, KB PSNR SSIM Runtime, ms MVideoSR 17 27.34 0.7799 3.05 ZX_VIP 20 27.52 0.7872 3.04 Fighter 11 27.34 0.7816 3.41 XJTU-MIGU SUPER 50 27.77 0.7957 3.25 BOE-IOT-AIBD 40 27.71 0.7820 1.97 GenMedia Group 135 28.40 0.8105 3.10 NCUT VGroup 35 27.46 0.7822 1.39 Mortar ICT 75 22.91 0.7546 1.76 RedCat AutoX 62 27.71 0.7945 7.26 221B 186 28.19 0.8093 10.1 了解最新的基于数据集 REDS / Viemo-90K / Vid4 / UDM10 / SPMCS / RealVSR的最新研究进展
Paper Source Training Set Testing Set Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search ICCV 2021 DIV2K Set5, Set14, B100 and Urban100 LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP) REDS REDS Cross-Resolution Flow Propagation for Foveated Video Super-Resolution Winter Conference on Applications of Computer Vision. 2023 REDS REDS Online Video Super-Resolution with Convolutional Kernel Bypass Graft arxiv 2022.8 REDS REDS Real-Time Super-Resolution for Real-World Images on Mobile Devices 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR) DIV2K DIV2K, Set5, Set14, BSD100, Manga109, and Urban100 Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting CVPR 2023 VSD4K VSD4K Rethinking Alignment in Video Super-Resolution Transformers NeurIPS 2022 REDS REDS SWAT的PSNR最好要刷到28以上, 完成 pruning, weight clustering, INT8/FP16 quantization
测试fintune之后的tensorflow模型和tflite模型 ->
对比的方法要在同一设置下 -> 设置对比排行榜
实验:SWRN整体框架不变替换Partial Standard Conv加持的VAB -> PSNR:27.76 无明显提高
查资料理解:attention机制怎样实现,怎样起作用,是否需要级联叠加
应用MobileOne结构重参数
Metrics
Full-Reference
- Peak Signal to Noise Ratio (PSNR)
- Structural SIMilarity (SSIM)
- Gradient Magnitude Similarity Deviation (GMSD)
No-Reference
- Naturalness Image Quality Evaluator (NIQE)
- Blind/Referenceless Image Spatial QUality Evaluator (BRISQUE)
- Distortion Identification-based Image Verity and INtegrity Evalutation (DIIVINE)
- BLind Image Integrity Notator using DCT-Statistics (BLIINDS)
Results
Milestone_0
| Model | Description | Dataset | Val PSNR | Val SSIM | Params | Runtime on oneplus7T [ms] | FLOPs [G] |
|---|---|---|---|---|---|---|---|
| VapSR_4_1 | Functional VapSR_4 with pixel norm realized by layer normalization, VAB activation: RELU, Attention using Partial conv | REDS | 27.790268 | 0.77721727 | 59,468 | 654.0 (INT8_CPU) | 7.462 |
| SWAT_0 | Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=4) | REDS | 27.842232 | 0.77754354 | 50,624 | 271.0 (FP16_CPU) | 5.803 |
| SWAT_1 | Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=2), replace fc with 1*1 conv | REDS | 27.759375 | 0.77492595 | 33,984 | 252.0 (FP16_CPU) | 3.900 |
| SWAT_2 | Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv | REDS | 27.760305 | 0.77487457 | 25,664 | - | - |
| SWAT_3 | Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization | REDS | 27.761642 | 0.7748446 | 25,664 | 27.8 (FP16_TFLite GPU Delegate) | 2.949 |
| SWAT_3_1 | Sliding Window, VAB Attention(large reception field=17), Partial Conv(point_wise: standard conv, depth_wise: group conv), Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization | REDS | 27.761642 | 0.7748446 | 25,664 | 27.8 (FP16_TFLite GPU Delegate) | 2.949 |
| SWAT_3_2 | Sliding Window, VAB Attention(large receptive field=17), Partial Conv(point_wise: standard conv, depth_wise: group conv), Channel Shuffle(mix_ratio=2), replace fc with 1*1 conv, replace pixel normalization with layer normalization | REDS | 27.74189 | 0.7742521 | 26,016 | 32.4 (FP16_TFLite GPU Delegate) | 2.996 |
| SWAT_4 | Sliding Window, VAB Attention, Replace partial conv with standard convlution, Remove Channel Shuffle, replace pixel normalization with layer normalization | REDS | 27.785185 | 0.77523285 | 53,696 | 38.5 (FP16_TFLite GPU Delegate) | 6.202 |
| SWAT_5 | Sliding Window, VAB Attention, Partial Conv, Channel Shuffle(mix_ratio=1), replace fc with 1*1 conv, replace pixel normalization with layer normalization, enlarge train step numbers to 250,000 | REDS | 27.811176 | 0.7763541 | 25,664 | 27.6 (FP16_TFLite GPU Delegate) | 2.949 |
| SWAT_6 | Sliding Window, VAB Attention, Partial Conv, Modified Channel Shuffle (mix_ratio:1), Remove convs of hidden forward/backward | REDS | 27.738842 | 0.7743317 | 21,056 | - (FP16_TFLite GPU Delegate) | 2.417 |
| SWAT_7 | Sliding Window, 3 branchs VAB Attention, Partial Conv, Remove Channel Shuffle, Replace pixel normalization with layer normalization | REDS | 27.645552 | 0.77121794 | 18,144 | - (FP16_TFLite GPU Delegate) | 2.090 |
| SWAT_8 | Sliding Window, VAB Attention modified 2 | REDS | 27.782675 | 0.77573705 | 45,424 | - (FP16_TFLite GPU Delegate) | 5.200 |
| SWAT_9 | Sliding Window, Non Activation Block | REDS | 27.636255 | 0.7709387 | 23,648 | 288.0 (FP16_TFLite GPU Delegate) | 2.113 |
AI benchmark setting for Runtime test:
- Input Values range(min,max): 0,255
- Inference Mode: INT8/FP16
- Acceleration: CPU/TFLite GPU Delegate
Milestone_1
| Model | Description | Dataset | Val PSNR | Val SSIM | Params | Runtime on oneplus7T [ms] | FLOPs [G] |
|---|---|---|---|---|---|---|---|
| SWAT_3_3 | Sliding Window, VAB Attention(large reception field=13 with channel shuffle[Dense(unints)]), Partial Conv(standard conv), Replace pixel normalization with layer normalization | REDS | 27.761633 | 0.7752705 | 27,472 | 30.3 (FP16_TFLite GPU Delegate) | 3.165 |
| SWAT_3_4 | Sliding Window, VAB Attention(large reception field=13 without channel shuffle[Dense(unints)], stack 2 blocks), Partial Conv(standard conv), Replace pixel normalization with layer normalization | REDS | 27.80347 | 0.77701694 | 32,832 | 40.9 (FP16_TFLite GPU Delegate) | 3.798 |
| SWAT_3_5 | Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization | REDS | 27.840628 | 0.7774375 | 37,312 | 39.4 (FP16_TFLite GPU Delegate) | 4.302 |
| SWAT_3_6 | Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization, Shallow feature extraction using standard conv | REDS | 27.8165 | 0.7774126 | 42,624 | 40.7 (FP16_TFLite GPU Delegate) | 4.916 |
| SWAT_3_7 | Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization, Remove concat and unpack of hidden state | REDS | 27.182861 | 0.7562948 | 29,136 | 29.0 (FP16_TFLite GPU Delegate) | 3.357 |
| SWAT_3_8 | Sliding Window, VAB Attention(large reception field=17 without channel shuffle[Dense(unints)], stack 2 blocks, pointwise conv for channel fusion without PConv), Partial Conv(standard conv), Replace pixel normalization with layer normalization, Remove concat and unpack of hidden state, Increase channels of fusion attention | REDS | 27.564552 | 0.7688081 | 56,032 | 39.1 (FP16_TFLite GPU Delegate) | 6.456 |
| SWAT_3_9 | Sliding Window, VAB Attention and IMDB hybrid | REDS | 27.95189 | 0.7806478 | 53,312 | 42.8 (FP16_TFLite GPU Delegate) | 6.367 |
| SWAT_3_10 | Sliding Window, Finetuned VAB Attention | REDS | 27.846352 | 0.77762717 | 53,512 | 49.0 (FP16_TFLite GPU Delegate) | 6.170 |
| ABPN_0 | Origin | REDS | 27.92307 | 0.779504 | 62,048 | 38.1/35.7 (INT8/FP16_TFLite GPU Delegate) | 7.137 |
| ABPN_1 | GenMedia Group Modified(L1 Charbonnier loss; crop_size:64) | REDS | 27.858198 | 0.7780704 | 58,304 | 37.1/33.0 (INT8/FP16_TFLite GPU Delegate) | 6.699 |
| ABPN_2 | GenMedia Group Modified(MAE loss; crop_size:96) | REDS | 27.875465 | 0.7783027 | 58,304 | 37.1/33.0 (INT8/FP16_TFLite GPU Delegate) | 6.699 |
| AFAVSR_0 | Multiple frames aggregation attention (num_feat=48, d_atten=64, num_blocks=2) | REDS | 27.837406 | 0.77741796 | 68,368 | 44.5 (FP16_TFLite GPU Delegate) | 7.872 |
| AFAVSR_1 | Multiple frames aggregation attention (num_feat=16, d_atten=32, num_blocks=8) | REDS | 27.829765 | 0.7763255 | 44,016 | 36.6 (FP16_TFLite GPU Delegate) | 5.069 |
| AFAVSR_2 | Multiple frames aggregation attention (num_feat=16, d_atten=32, num_blocks=2) | REDS | (FP16_TFLite GPU Delegate) | ||||
| AFAVSR_3 | All batch frames aggregation attention (num_feat=32, d_atten=64, num_blocks=2) | REDS | - | - | - | - | - |
| SORT_0 | Sliding Window, IMDB | REDS | 27.738451 | 0.77409536 | 17,356 | 20.6 (FP16_TFLite GPU Delegate) | 2.084 |
| SORT_1 | Sliding Window, IMDB, ConvTail num_out_channel=48 | REDS | 27.75588 | 0.7749552 | 19,660 | 21.6 (FP16_TFLite GPU Delegate) | 2.351 |
| SORT_2 | Sliding Window, IMDB, multi-branch distillation channel num hyperparameter tunning | REDS | 27.93981 | 0.7808094 | 45,264 | 35.6 (FP16_TFLite GPU Delegate) | 5.385 |
| SORT_3 | Sliding Window, IMDB, multi-branch distillation channel num hyperparameter tunning, Replace SEL with CCA( Contrast-Aware Channel Attention) | REDS | 27.867216 | 0.7790734 | 39,144 | 35.3 (FP16_TFLite GPU Delegate) | 4.414 |
| SORT_4 | Sliding Window, Modified IMDB equipped with channel attention mechanism | REDS | 27.769545 | 0.7755401 | 39,120 | 41.3(FP16_TFLite GPU Delegate) | 5.725 |
| SORT_5 | Sliding Window, Modified IMDB equipped with larger channel width and channel reduction/aggregation using 1*1 convs | REDS | 28.13419 | 0.78656757 | 166,944 | 85.9(FP16_TFLite GPU Delegate) | 19.566 |
| SORT_6 | Sliding Window, Modified IMDB equipped with dynamic channel width | REDS | 27.944357 | 0.7809873 | 48,216 | - (FP16_TFLite GPU Delegate) | - |
AI benchmark setting for Runtime test:
- Input Values range(min,max): 0,255
- Inference Mode: INT8/FP16
- Acceleration: CPU/TFLite GPU Delegate
Milestone_2
| Model | Description | Dataset | Val PSNR | Val SSIM | Params | Runtime on oneplus7T [ms] | FLOPs [G] |
|---|---|---|---|---|---|---|---|
| VSR_0 | Sliding Window, Non Activation Block | REDS | 27.673386 | 0.7725643 | 26,368 | 57.9 (FP16_TFLite GPU Delegate) | 2.417 |
| VSR_1 | Attention Alignment_0, Non Activation Block | REDS | 27.508242 | 0.76671493 | 17,440 | 42.0 (FP16_TFLite GPU Delegate) | 1.677 |
| VSR_2 | Attention Alignment_1, Non Activation Block,Rectify BSConvolution | REDS | 27.53437 | 0.7678055 | 17,776 | error (FP16_TFLite GPU Delegate) | 2.035 |
| VSR_3 | VSR_2 Ablation: Attention Alignment_1 | REDS | 27.414068 | 0.76361054 | 17,413 | 60.3 (FP16_TFLite GPU Delegate) | 1.793 |
| VSR_4 | VSR_2 -> modify Non Activation Block using partial conv | REDS | 27.784992 | 0.7769825 | 43,120 | error (FP16_TFLite GPU Delegate) | 4.958 |
| VSR_5 | VSR_4 Ablation: RGB out channels sharing upsample result | REDS | 27.835686 | 0.7776796 | 47,728 | - (FP16_TFLite GPU Delegate) | 5.491 |
| VSR_6 | VSR_5 Finetune: Non Activation Block channel numbers modify | REDS | 27.783165 | 0.7768693 | 28,976 | error (FP16_TFLite GPU Delegate) | 3.699 |
| VSR_7 | Light weight hidden states attention alignment; Blue Print convolution for shallow feature extraction; Multi-Stage ExcavatoR(MSER) combined with partial convolution and simplified channel attention | REDS | 27.470276 | 0.7664948 | 81,806 | 66.1 (FP16_TFLite GPU Delegate) | 7.938 |
| VSR_8 | Light weight hidden states attention alignment; Blue Print convolution for shallow feature extraction; Nonlinear activation free block | REDS | 27.91092 | 0.77971315 | 66,312 | 64.7 (FP16_TFLite GPU Delegate) | 7.269 |
| VSR_9 | vsr_9 ablation: feature alignment | REDS | 27.91092 | 0.77971315 | 39,792 | 44.5 (FP16_TFLite GPU Delegate) | 4.218 |
| VSR_10 | motivation: IMDB + PartialConv + VapSR + BSConv | REDS | 27.963232 | 0.780958 | 44,256 | 48.6 (FP16_TFLite GPU Delegate) | 5.103 |
| VSR_11 | VSR_10 ablation: hidden state conv using bias | REDS | 27.948818 | 0.7809571 | 44,288 | 47.2 (FP16_TFLite GPU Delegate) | 5.103 |
| VSR_12 | VSR_10 ablation: hidden state process using modified IMDB | REDS | 27.953104 | 0.7807622 | 57,696 | 62.8 (FP16_TFLite GPU Delegate) | 6.649 |
AI benchmark setting for Runtime test:
- Input Values range(min,max): 0,255
- Inference Mode: INT8/FP16
- Acceleration: CPU/TFLite GPU Delegate
Milestone_3
| Model | Description | Dataset | Val PSNR | Val SSIM | Params | Runtime on oneplus7T [ms] | FLOPs [G] |
|---|---|---|---|---|---|---|---|
| MVSR_0 | modified IMDB IMDB + PartialConv + VapSR + BSConv; deprecate hidden state forward and backward; light weight feature alignment | REDS | 27.915539 | 0.7799377 | 35,777 | 38.8 (FP16_TFLite GPU Delegate) | 4.068 |
| MVSR_1 | modified IMDB IMDB + PartialConv + VapSR + BSConv; deprecate hidden state forward and backward; light weight frame alignment | REDS | 27.932716 | 0.7810435 | 34,473 | 44.3 (FP16_TFLite GPU Delegate) | 3.976 |
| MVSR_2 | MVSR_1 Ablation: light weight frame alignment | REDS | 27.929586 | 0.78039753 | 34,208 | 35.4 (FP16_TFLite GPU Delegate) | 3.944 |
| MVSR_3 | MVSR_1 Ablation: large receptive field in SMDB -> reduce: 3x3 + 3x3 dilated | REDS | 27.892586 | 0.7790079 | 32,169 | 41.1 (FP16_TFLite GPU Delegate) | 3.711 |
| MVSR_4 | MVSR_2 Ablation: large receptive field in SMDB -> increase: 7x7 + 7x7 dilated | REDS | 27.958328 | 0.78145003 | 37,664 | 42.4 (FP16_TFLite GPU Delegate) | 4.343 |
| MVSR_5 | MVSR_1 Ablation: large receptive field in SMDB -> increase: 7x7 + 7x7 dilated | REDS | 27.936714 | 0.7809204 | 37,929 | 49.8 (FP16_TFLite GPU Delegate) | 4.375 |
| MVSR_6 | modified IMDB IMDB + PartialConv based pixel attention version_0 + VapSR + BSConv; light weight frame alignment | REDS | 27.884369 | 0.7790964 | 34,473 | 44.4 (FP16_TFLite GPU Delegate) | 4.246 |
| MVSR_7 | modified IMDB IMDB + PartialConv based pixel attention version_1 + VapSR + BSConv; light weight frame alignment | REDS | 27.858534 | 0.77831227 | 35,769 | 44.5 (FP16_TFLite GPU Delegate) | 4.387 |
| MVSR_8 | MVSR_1 Ablation: SEL -> Channel Attention | REDS | 27.610485 | 0.7696045 | 29,145 | 40.5 (FP16_TFLite GPU Delegate) | 3.001 |
| MVSR_9 | MVSR_1 Ablation: Channel fuse + SEL -> FlashModule + Channel fuse | REDS | 28.043566 | 0.7842476 | 96,249 | 72.8 (FP16_TFLite GPU Delegate) | 10.684 |
| MVSR_10 | Partial conv idea applied to MSDB and Attention(i.e. SEL) | REDS | 27.86422 | 0.7783118 | 27,081 | 41.2 (FP16_TFLite GPU Delegate) | 3.031 |
| MVSR_11 | MVSR_10 fintune: deperecae MSDB’s channel fuse; add MDSB blocks | REDS | 27.90566 | 0.7793118 | 32,553 | 48.7 (FP16_TFLite GPU Delegate) | 3.634 |
| MVSR_12 | MVSR_11 ablation: MSDB’s group convolution -> standard convolution | REDS | 27.953104 | 0.7807622 | 68,169 | 38.4 (FP16_TFLite GPU Delegate) | 7.737 |
| MVSR_13 | MVSR_12 AttentionAlign module evolution | REDS | 27.966156 | 0.7809557 | 68,157 | 39.8 (FP16_TFLite GPU Delegate) | 7.735 |
| MVSR_13_1 | MVSR_13 evolution: ConvTail used for increasing dimension -> BSConv | REDS | 27.879667 | 0.7790071 | 62,541 | 40.6 (FP16_TFLite GPU Delegate) | 7.080 |
| MVSR_13_2 | MVSR_13 ablation: fractional/partial ratio 1/2 -> 1/4 | REDS | 27.877321 | 0.7783568 | 37,517 | 37.1 (FP16_TFLite GPU Delegate) | 4.152 |
| MVSR_13_3 | MVSR_13 ablation: fractional/partial ratio 1/2 -> 1/8 | REDS | 27.79014 | 0.77567685 | 29,829 | 35.5 (FP16_TFLite GPU Delegate) | 3.240 |
| MVSR_13_4 | MVSR_13 ablation: fractional/partial ratio 1/2 -> 3/4 | REDS | 27.955465 | 0.78150684 | 119,149 | 84.1 (FP16_TFLite GPU Delegate) | 13.663 |
| MVSR_13_4_revalid | MVSR_13 ablation: fractional/partial ratio 1/2 -> 3/4 | REDS | 27.956861 | 0.7814712 | 119,149 | 84.1 (FP16_TFLite GPU Delegate) | 13.663 |
| MVSR_13_5 | MVSR_13 ablation: fractional/partial ratio 1/2 -> 7/8 | REDS | 27.993414 | 0.7823691 | 152,277 | 103.0 (FP16_TFLite GPU Delegate) | 17.506 |
| MVSR_13_6 | MVSR_13 ablation: fractional/partial ratio 1/2 -> 3/8 | REDS | 27.948065 | 0.78071 | 50,293 | 39.0 (FP16_TFLite GPU Delegate) | 5.651 |
| MVSR_13_7 | MVSR_13 ablation: fractional/partial ratio 1/2 -> 5/8 | REDS | 27.983498 | 0.7823881 | 91,109 | 86.4 (FP16_TFLite GPU Delegate) | 10.406 |
| MVSR_14 | MVSR_13 ablation: - frame attention align -> standard conv 1 x 1 act as frame information propogation operator | REDS | 27.930904 | 0.78043896 | 68,272 | 29.7 (FP16_TFLite GPU Delegate) | 7.746 |
| MVSR_15 | MVSR_13 ablation: MSDB block number 4 -> 3 | REDS | 27.91523 | 0.77966064 | 52,965 | 33.4 (FP16_TFLite GPU Delegate) | 6.016 |
| MVSR_16 | MVSR_13 ablation: No partial/fractional; No BSconv (Blueprint Separable conv); No receptive field decomposition | REDS | 27.928417 | 0.7801167 | 920,381 | 399.0 (FP16_TFLite GPU Delegate) | 106.045 |
| MVSR_17 | MVSR_13 evolution: MSDB using standard conv 3 x 3, PPA using split large receptive field conv 5 x 5 + 5 x 5 dilated | REDS | 27.902325 | 0.7794845 | 47,101 | 36.8 (FP16_TFLite GPU Delegate) | 5.313 |
| MVSR_18 | MVSR_17 ablation: BSconv | REDS | 27.893446 | 0.77926654 | 47,325 | 34.4 (FP16_TFLite GPU Delegate) | 5.340 |
| MVSR_19 | MVSR_13 evolution: MSDB blocks 4 -> 3; Enlarge receptive field of PPA 3 -> 17 | REDS | 27.914143 | 0.7799854 | 60,861 | 35.9 (FP16_TFLite GPU Delegate) | 6.924 |
| MVSR_20 | MVSR_13 ablation: No receptive field decomposition | REDS | 27.93524 | 0.78071207 | 251,613 | 119.0 (FP16_TFLite GPU Delegate) | 28.875 |
| MVSR_21 | MVSR_13 ablation: No frame align; No fractional/partial; No BSconv; No receptive field decomposition | REDS | 27.941408 | 0.7807615 | 920,128 | 400.0 (FP16_TFLite GPU Delegate) | 106.014 |
| MVSR_21_1 | MVSR_13 ablation: No frame align (directly extraction from 3 consecutive frames); No fractional/partial; No BSconv; No receptive field decomposition | REDS | 27.913836 | 0.779473 | 920,992 | 399.0 (FP16_TFLite GPU Delegate) | 106.114 |
| MVSR_22 | MVSR_13 ablation: No BSconv; No receptive field decomposition | REDS | 27.936152 | 0.7799803 | 251,837 | 118.0 (FP16_TFLite GPU Delegate) | 28.902 |
| MVSR_23 | MVSR_13 ablation: PFE PPA Standard conv -> Depthwise conv | REDS | 27.867388 | 0.7784562 | 32,541 | 49.5 (FP16_TFLite GPU Delegate) | 3.632 |
| MVSR_24 | MVSR_13 ablation: - Partial/Fractional Extraction | REDS | 27.952333 | 0.78090274 | 186,141 | 94.3 (FP16_TFLite GPU Delegate) | 21.449 |
| MVSR_24_revalid | MVSR_13 ablation: - Partial/Fractional Extraction (keep fc) | REDS | 27.940563 | 0.7804851 | 190,493 | 102.0 (FP16_TFLite GPU Delegate) | 21.935 |
| MVSR_25 | MVSR_13 ablation: - BSConv | REDS | 27.929697 | 0.780183 | 68,381 | 38.8 (FP16_TFLite GPU Delegate) | 7.762 |
| MVSR_26 | MVSR_13 ablation: - Large Receptive Field Decomposition | REDS | 27.972654 | 0.7818956 | 251,613 | 119.0 (FP16_TFLite GPU Delegate) | 28.875 |
| MVSR_27 | MVSR_13 ablation: - FC in PFE, PPA | REDS | 27.945955 | 0.78067327 | 63,805 | 37.8 (FP16_TFLite GPU Delegate) | 7.249 |
AI benchmark setting for Runtime test:
- Input Values range(min,max): 0,255
- Inference Mode: INT8/FP16
- Acceleration: CPU/TFLite GPU Delegate
Benchmark_0
| Rank | Model | Source | Dataset | Test PSNR | Test SSIM | Params | Runtime on oneplus7T [ms] |
|---|---|---|---|---|---|---|---|
| 1 | Diggers | Real-Time Video Super-Resolution based on Bidirectional RNNs(2021 SOTA) | REDS(train_videos: 240, test_videos: 30) | 27.98 | - | 39,640 | - |
| 2 | VSR_12 | Ours | REDS(train_videos: 240, test_videos: 30) | 27.981062 | 0.7824855 | 57,696 | 62.8 |
| 3 | MVSR_4 | Ours | REDS(train_videos: 240, test_videos: 30) | 27.958328 | 0.78145003 | 37,664 | 42.4 |
| 4 | MVSR_12 | Ours | REDS(train_videos: 240, test_videos: 30) | 27.953104 | 0.7807622 | 68,169 | 38.4 |
| 5 | SORT_2 | Ours | REDS(train_videos: 240, test_videos: 30) | 27.93981 | 0.7808094 | 45,264 | 35.6 |
| 6 | SWRN | Sliding Window Recurrent Network for Efficient Video Super-Resolution (2022 SOTA) | REDS(train_videos: 240, test_videos: 30) | 27.92 | 0.77 | 43,472 | 31.0 |
| 7 | MVSR_11 | Ours | REDS(train_videos: 240, test_videos: 30) | 27.90566 | 0.7793118 | 32,553 | 48.7 |
| 8 | SWAT_3_5 | Ours | REDS(train_videos: 240, test_videos: 30) | 27.840628 | 0.7774375 | 37,312 | 39.4 |
| 9 | EESRNet | EESRNet: A Network for Energy Efficient Super-Resolution(2022) | REDS(train_videos: 240, test_videos: 30) | 27.84 | - | 62,550 | - |
| 10 | LiDeR | LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices (2022) | REDS(train_videos: 240, test_videos: 30) | 27.51 | 0.76 | - | - |
| 11 | EVSRNet | EVSRNet:Efficient Video Super-Resolution with Neural Architecture Search(2021) | REDS(train_videos: 240, test_videos: 30) | 27.42 | - | - | - |
| 12 | RCBSR | RCBSR: Re-parameterization Convolution Block for Super-Resolution(2022) | REDS(train_videos: 240, test_videos: 30) | 27.28 | 0.775 | - | - |
Benchmark_1
| Model | Source | Dataset | Test PSNR | Test SSIM | Params |
|---|---|---|---|---|---|
| SSL-uni | Structured Sparsity Learning for Efficient Video Super-Resolution (CVPR2023) | REDS(train:266 test:4) | 30.24 | 0.86 | 500,000 |
PaperWriting
No.1
- BSConvU as shallow feature extraction
- Recurrent neural network for feature information freedom flow cross frames
- multi distilation module through dynamic routing large ERF attention
- Bilineared RGB channels share same upsample result
- Nearest conv for shorter residual inference time compared with bilinear residual
No.2
- Motivation: 移动端视频超分 Inference Time ↓, PSNR ↑, SSIM ↑
- 只用当前处理LR帧的前一个预测HR帧做参考补偿当前帧 -> 拍摄的同时实时超分,不受只能对拍摄完成的视频进行超分的限制
- 假设模型中间的feature maps对输出结果不是同等贡献度,如何进行高贡献度的feature maps聚集aggregation -> 做Partial Convolution accelerate inference(分析)
- 减少模型中的activation -> 利用Multiply产生非线性映射的能力
- RGB三通道共享上采样补偿 -> 常规模型的RGB三通道上采样补偿是否存在高度一致性,若存在则可以共享以起到降低计算量加速推理的效果(分析)
- 蓝图卷积作为浅层特征提取 -> 效果反而比标准卷积最终的效果好
- 多尺度特征(降采样到不同尺度)基于注意力机制融合 <- motivation: 灵长类动物视觉皮层同一区域不同神经元感受野不同,类比到模型内则是同一层内从不同尺度/感受野捕获更精确的空间信息或更多的纹理信息
- 短距离shortcut的fusion -> 加速推理
No.3
- Motivation: 移动端视频超分 Inference Time ↓, PSNR ↑, SSIM ↑
- 辅助前后向传播的隐藏状态做对齐(auxiliary forward/backward hidden states for feature alignment) -> 提升超分结果PSNR
- 假设模型中间的feature maps对输出结果不是同等贡献度,如何进行高贡献度的feature maps聚集aggregation -> 做Partial Convolution accelerate inference(分析)
- 减少模型中的activation -> 利用Multiply产生非线性映射的能力,加速推理
- 考虑动态深度(adaptive existing) -> 加速推理 -> deprecated
PaperReference
- Rethinking Alignment in Video Super-Resolution Transformers(NIPS 2022) -> VIT 视频超分(VSR)中帧/特征对齐不是必要操作
- An Implicit Alignment for Video Super-Resolution (ArXiv 2023) -> bilinear interpolation/resample 改进
- Video Super-Resolution Transformer
- Efficient Reference-based Video Super-Resolution (ERVSR): Single Reference Image Is All You Need (WACV 2023) -> 帧序列中间帧作为参考帧辅助当前帧超分
- MULTI-STAGE FEATURE ALIGNMENT NETWORK FOR VIDEO SUPER-RESOLUTION
- ELSR: Extreme Low-Power Super Resolution Network For Mobile Devices
- LiDeR: Lightweight Dense Residual Network for Video Super-Resolution on Mobile Devices
- Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling
- COLLAPSIBLE LINEAR BLOCKS FOR SUPER-EFFICIENT SUPER RESOLUTION
- Revisiting Temporal Alignment for Video Restoration
- BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment
- EVSRNet:Efficient Video Super-Resolution with Neural Architecture Search
- BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond
- Revisiting Temporal Modeling for Video Super-resolution -> MAI 第一届VSR 官方baseline
- TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution (CVPR 2020)
- Video Super-resolution with Temporal Group Attention (CVPR 2020)
- 3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks
- Frame-Recurrent Video Super-Resolution
- Video Super-Resolution With Convolutional Neural Networks