Skip to content

Instantly share code, notes, and snippets.

@ZHAOZHIHAO
Last active September 18, 2020 21:46
Show Gist options
  • Save ZHAOZHIHAO/8b5cb962071ab9d015711f0cf6b0281a to your computer and use it in GitHub Desktop.
Save ZHAOZHIHAO/8b5cb962071ab9d015711f0cf6b0281a to your computer and use it in GitHub Desktop.
Papers on attentions
1. Short walk-through on attention papers
1.1 Attention for CNNs
Squeeze-and-Excitation Networks 2017, figure. 2
Generate a-single-value attention for each channel.
CBAM: Convolutional Block Attention Module 2018, figure. 1 and 2
For CNNs. Combine channel attention and spatial attention: first use channel attention, then spatial attention.
Non-local neural networks 2017, figure. 2
The output of a convolutional layer consists of T*H*W*C (batch size, height, width, channels) units.
For each unit, calculate the effect of all other units to it. This results T*H*W*C values.
Add these values to the original output units.
CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1
A work based on Non-local neural networks. This work optimizes “For each unit, calculate the effect of all other units to it”:
i) GPU memory friendly; ii) 2) High computational efficiency; iii) The state-of-the-art performance.
It first calculates the effect of units that are the same row as well as the same column of the target unit.
Then repeat “calculates the effect of units that are the same row or same column of the target unit”.
After one repeat, a unit would receive effect from all other units.
Dynamic Convolution: Attention over Convolution Kernels
https://www.youtube.com/watch?v=FNkY7I2R_zM Let a traditional (compared to work in this paper) layer has N kernels.
The work uses K kernels to aggregate one of the N kernels. The attention for the K kernels is computed from the input
data (or input feature maps). This trick makes the model having much more parameters (so the experiments are
operated on MobileNet), but the computation cost almost remains the same, because computing the attention (max pool
the input and then apply a MLP) and aggregating the kernels (weight sum on small kernels) is not computational expensive.
Exploring Self-attention for Image Recognition
https://blog.csdn.net/gitcat/article/details/106984692, https://www.pianshen.com/article/58031385858/
Axial Attention in Multidimensional Transformers, 2019
Super similar to CCNet. It calculates the effect of units that are the same row or same column of the target unit.
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
https://www.youtube.com/watch?v=-iAXF-vibdE https://www.youtube.com/watch?v=hv3UO3G0Ofo
This work is based on the paper “Non-local neural networks”. It has two novelties, i) the non-local attention is computed
in a much cheaper way. Each time when computing the attentions, only the neighbors at the same row or column are utilized.
ii) position embedding, for each position pair (i, j), three embedding vectors are computed to be queried by query, key,
and value vector respectively.
One highlight of this paper is that the axial-attention can be used in a stand-alone way with very good performance. In other
words, remove the traditional convolutional layer and only use this kind of forward operation.
First I watched the short explanation video by the paper authors. I learned the details of this paper. Then I watched the
explanation video by Yannic Kilcher. He said the attention has replaced LSTM, it may replace the convolution operation in
the next year, month, day or minute.
1.2 Attention for NLP
Fast Transformers with Clustered Attention, 2020 July
Computing attention requires quadratic complexity N*N with respect to the sequence length N, thus making them prohibitively
expensive for large sequences. Thus cluster the queries into C clusters to reduce the computation complexity to N*C.
While computing attention, each query is replaced by the centroid of its centroid.
Efficient Content-Based Sparse Attention with Routing Transformers, 2019
The reviews are very very informative, https://openreview.net/forum?id=B1gjs6EtDr. Reject by ICLR 2020.
This paper learns the attention sparsity in a data driven fashion. It divides the tokens to several clusters by clustering
the sum of product of query multiplied with a projection matrix and product of key multiplied with the same projection matrix.
This is mouthful. In math, we cluster the rows of R = Q \times W_R + K \times W_R, where W_R is the projection matrix.
2. Taxonomizing
2.1 Review/Surveys
Efficient Transformers: A Survey, Sep 14 2020. A good paper. *****, five stars.
2.2 Papers to reduce the computation complexity
Fast Transformers with Clustered Attention, 2020 July
Reformer: The Efficient Transformer, 2020 Feb
Large Memory Layers with Product Keys, 2019 hot on Github
Generating Long Sequences with Sparse Transformers, 2019, hot on Github
Efficient Content-Based Sparse Attention with Routing Transformers, 2019
2.3 Papers using “weird” connections to reduce computation complexity
CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1
Axial Attention in Multidimensional Transformers, 2019
Star-Transformer, 2019
Interlaced Sparse Self-Attention for Semantic Segmentation, 2019
SCRAM: Spatially Coherent Randomized Attention Maps, 2019
Transformer on a Diet, 2020 Feb
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection, 2020, April
2.4 Papers to have a better performance on long text sequences
BP-Transformer: Modelling Long-Range Context via Binary Partitioning, 2019
Compressive Transformers for Long-Range Sequence Modelling, 2019
Longformer: The Long-Document Transformer, April 2019, hot on GitHub
2.5 Applications
Jukebox: A Generative Model for Music, 2020 April, Explained by Yannic
2.6 TODO papers
SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks, 2020 Jun
Equivariant Transformer Networks, 2019
Improving the Robustness of Capsule Networks to Image Affine Transformations
Capsule-Transformer for Neural Machine Translation
Visual-textual Capsule Routing for Text-based Video Segmentation
Group Feedback Capsule Network
Capsules for Object Segmentation (SegCaps)
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, 2019
3. Resources/credits of this article
https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/ Blog on many attentions.
https://twitter.com/ilyasut/status/1111057995650072576?lang=en little known fact, the transformer is a close cousin of the capsule network, because soft attention is "routing by agreement".
http://www.cs.toronto.edu/~hinton/absps/object-based81.pdf A paper in 1981 by Hinton related to CNNs and capsule networks
https://www.cnblogs.com/xiximayou/p/13192378.html, 改进卷积操作的一些论文
https://github.com/cszn/KAIR
https://mp.weixin.qq.com/s?__biz=MzI5MDUyMDIxNA==&mid=2247505981&idx=3&sn=eebcfd42b1f6eaeb265c983c05f657d5&chksm=ec1c35c4db6bbcd25bee7fc27c5380da0ab502a3ffcb6767588f199c8b5d15574ebace68b18a&mpshare=1&scene=1&srcid=08287QPmzCUFem64oVejWEFA&sharer_sharetime=1598570878555&sharer_shareid=76cd298123cfe939784e14c0c20d80be&key=10b5f81a6836622319d9c3cf6c2ba88426e39fb9095f0d3aa64bf812d60c27e9492df3c2aa1eeeb1c8c5b9a7b5fd943019926b4228dd1a6e2775f9908b54c07a573c7c298dd6746d2fc3f197c7a476eadf47cafabb41924bf4fbba372f6a46ae6e1821d5c41ae116245ec10e9afe1d776b848d2bd816edadee2a01e6c7e26489&ascene=1&uin=MjAwOTEzMzgzNA%3D%3D&devicetype=Windows+10+x64&version=62090529&lang=en&exportkey=AzryKHsZgMlDLj0wVXHGWrE%3D&pass_ticket=fwwWHvq2CqMcdTKhhkZmShqwghX3ZRjk2WAlC80EsasaDDe3zwRGGRlWEFOX1Ioh&wx_header=0 魔改Attention大集合
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment