Last active
September 18, 2020 21:46
-
-
Save ZHAOZHIHAO/8b5cb962071ab9d015711f0cf6b0281a to your computer and use it in GitHub Desktop.
Papers on attentions
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Short walk-through on attention papers | |
1.1 Attention for CNNs | |
Squeeze-and-Excitation Networks 2017, figure. 2 | |
Generate a-single-value attention for each channel. | |
CBAM: Convolutional Block Attention Module 2018, figure. 1 and 2 | |
For CNNs. Combine channel attention and spatial attention: first use channel attention, then spatial attention. | |
Non-local neural networks 2017, figure. 2 | |
The output of a convolutional layer consists of T*H*W*C (batch size, height, width, channels) units. | |
For each unit, calculate the effect of all other units to it. This results T*H*W*C values. | |
Add these values to the original output units. | |
CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1 | |
A work based on Non-local neural networks. This work optimizes “For each unit, calculate the effect of all other units to it”: | |
i) GPU memory friendly; ii) 2) High computational efficiency; iii) The state-of-the-art performance. | |
It first calculates the effect of units that are the same row as well as the same column of the target unit. | |
Then repeat “calculates the effect of units that are the same row or same column of the target unit”. | |
After one repeat, a unit would receive effect from all other units. | |
Dynamic Convolution: Attention over Convolution Kernels | |
https://www.youtube.com/watch?v=FNkY7I2R_zM Let a traditional (compared to work in this paper) layer has N kernels. | |
The work uses K kernels to aggregate one of the N kernels. The attention for the K kernels is computed from the input | |
data (or input feature maps). This trick makes the model having much more parameters (so the experiments are | |
operated on MobileNet), but the computation cost almost remains the same, because computing the attention (max pool | |
the input and then apply a MLP) and aggregating the kernels (weight sum on small kernels) is not computational expensive. | |
Exploring Self-attention for Image Recognition | |
https://blog.csdn.net/gitcat/article/details/106984692, https://www.pianshen.com/article/58031385858/ | |
Axial Attention in Multidimensional Transformers, 2019 | |
Super similar to CCNet. It calculates the effect of units that are the same row or same column of the target unit. | |
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | |
https://www.youtube.com/watch?v=-iAXF-vibdE https://www.youtube.com/watch?v=hv3UO3G0Ofo | |
This work is based on the paper “Non-local neural networks”. It has two novelties, i) the non-local attention is computed | |
in a much cheaper way. Each time when computing the attentions, only the neighbors at the same row or column are utilized. | |
ii) position embedding, for each position pair (i, j), three embedding vectors are computed to be queried by query, key, | |
and value vector respectively. | |
One highlight of this paper is that the axial-attention can be used in a stand-alone way with very good performance. In other | |
words, remove the traditional convolutional layer and only use this kind of forward operation. | |
First I watched the short explanation video by the paper authors. I learned the details of this paper. Then I watched the | |
explanation video by Yannic Kilcher. He said the attention has replaced LSTM, it may replace the convolution operation in | |
the next year, month, day or minute. | |
1.2 Attention for NLP | |
Fast Transformers with Clustered Attention, 2020 July | |
Computing attention requires quadratic complexity N*N with respect to the sequence length N, thus making them prohibitively | |
expensive for large sequences. Thus cluster the queries into C clusters to reduce the computation complexity to N*C. | |
While computing attention, each query is replaced by the centroid of its centroid. | |
Efficient Content-Based Sparse Attention with Routing Transformers, 2019 | |
The reviews are very very informative, https://openreview.net/forum?id=B1gjs6EtDr. Reject by ICLR 2020. | |
This paper learns the attention sparsity in a data driven fashion. It divides the tokens to several clusters by clustering | |
the sum of product of query multiplied with a projection matrix and product of key multiplied with the same projection matrix. | |
This is mouthful. In math, we cluster the rows of R = Q \times W_R + K \times W_R, where W_R is the projection matrix. | |
2. Taxonomizing | |
2.1 Review/Surveys | |
Efficient Transformers: A Survey, Sep 14 2020. A good paper. *****, five stars. | |
2.2 Papers to reduce the computation complexity | |
Fast Transformers with Clustered Attention, 2020 July | |
Reformer: The Efficient Transformer, 2020 Feb | |
Large Memory Layers with Product Keys, 2019 hot on Github | |
Generating Long Sequences with Sparse Transformers, 2019, hot on Github | |
Efficient Content-Based Sparse Attention with Routing Transformers, 2019 | |
2.3 Papers using “weird” connections to reduce computation complexity | |
CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1 | |
Axial Attention in Multidimensional Transformers, 2019 | |
Star-Transformer, 2019 | |
Interlaced Sparse Self-Attention for Semantic Segmentation, 2019 | |
SCRAM: Spatially Coherent Randomized Attention Maps, 2019 | |
Transformer on a Diet, 2020 Feb | |
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection, 2020, April | |
2.4 Papers to have a better performance on long text sequences | |
BP-Transformer: Modelling Long-Range Context via Binary Partitioning, 2019 | |
Compressive Transformers for Long-Range Sequence Modelling, 2019 | |
Longformer: The Long-Document Transformer, April 2019, hot on GitHub | |
2.5 Applications | |
Jukebox: A Generative Model for Music, 2020 April, Explained by Yannic | |
2.6 TODO papers | |
SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks, 2020 Jun | |
Equivariant Transformer Networks, 2019 | |
Improving the Robustness of Capsule Networks to Image Affine Transformations | |
Capsule-Transformer for Neural Machine Translation | |
Visual-textual Capsule Routing for Text-based Video Segmentation | |
Group Feedback Capsule Network | |
Capsules for Object Segmentation (SegCaps) | |
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, 2019 | |
3. Resources/credits of this article | |
https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/ Blog on many attentions. | |
https://twitter.com/ilyasut/status/1111057995650072576?lang=en little known fact, the transformer is a close cousin of the capsule network, because soft attention is "routing by agreement". | |
http://www.cs.toronto.edu/~hinton/absps/object-based81.pdf A paper in 1981 by Hinton related to CNNs and capsule networks | |
https://www.cnblogs.com/xiximayou/p/13192378.html, 改进卷积操作的一些论文 | |
https://github.com/cszn/KAIR | |
https://mp.weixin.qq.com/s?__biz=MzI5MDUyMDIxNA==&mid=2247505981&idx=3&sn=eebcfd42b1f6eaeb265c983c05f657d5&chksm=ec1c35c4db6bbcd25bee7fc27c5380da0ab502a3ffcb6767588f199c8b5d15574ebace68b18a&mpshare=1&scene=1&srcid=08287QPmzCUFem64oVejWEFA&sharer_sharetime=1598570878555&sharer_shareid=76cd298123cfe939784e14c0c20d80be&key=10b5f81a6836622319d9c3cf6c2ba88426e39fb9095f0d3aa64bf812d60c27e9492df3c2aa1eeeb1c8c5b9a7b5fd943019926b4228dd1a6e2775f9908b54c07a573c7c298dd6746d2fc3f197c7a476eadf47cafabb41924bf4fbba372f6a46ae6e1821d5c41ae116245ec10e9afe1d776b848d2bd816edadee2a01e6c7e26489&ascene=1&uin=MjAwOTEzMzgzNA%3D%3D&devicetype=Windows+10+x64&version=62090529&lang=en&exportkey=AzryKHsZgMlDLj0wVXHGWrE%3D&pass_ticket=fwwWHvq2CqMcdTKhhkZmShqwghX3ZRjk2WAlC80EsasaDDe3zwRGGRlWEFOX1Ioh&wx_header=0 魔改Attention大集合 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment