ZHAOZHIHAO · September 18, 2020 21:46
diff --git a/Attentions b/Attentions
 1. Short walk-through on attention papers
 1.1 Attention for CNNs
    Squeeze-and-Excitation Networks 2017, figure. 2
        Generate a-single-value attention for each channel.
    CBAM: Convolutional Block Attention Module 2018, figure. 1 and 2
        For CNNs. Combine channel attention and spatial attention: first use channel attention, then spatial attention.
    Non-local neural networks 2017, figure. 2
        The output of a convolutional layer consists of T*H*W*C (batch size, height, width, channels) units. 
        For each unit, calculate the effect of all other units to it. This results T*H*W*C values. 
        Add these values to the original output units.
    CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1
        A work based on Non-local neural networks. This work optimizes “For each unit, calculate the effect of all other units to it”: 
        i) GPU memory friendly; ii) 2) High computational efficiency; iii) The state-of-the-art performance.  
        It first calculates the effect of units that are the same row as well as the same column of the target unit.
        Then repeat “calculates the effect of units that are the same row or same column of the target unit”. 
        After one repeat, a unit would receive effect from all other units.
    Dynamic Convolution: Attention over Convolution Kernels
        https://www.youtube.com/watch?v=FNkY7I2R_zM Let a traditional (compared to work in this paper) layer has N kernels. 
        The work uses K kernels to aggregate one of the N kernels. The attention for the K kernels is computed from the input 
        data (or input feature maps). This trick makes the model having much more parameters (so the experiments are 
        operated on MobileNet), but the computation cost almost remains the same, because computing the attention (max pool 
        the input and then apply a MLP) and aggregating the kernels (weight sum on small kernels) is not computational expensive. 
    Exploring Self-attention for Image Recognition
        https://blog.csdn.net/gitcat/article/details/106984692, https://www.pianshen.com/article/58031385858/  
    Axial Attention in Multidimensional Transformers, 2019
        Super similar to CCNet. It calculates the effect of units that are the same row or same column of the target unit.
    Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
        https://www.youtube.com/watch?v=-iAXF-vibdE https://www.youtube.com/watch?v=hv3UO3G0Ofo 
        This work is based on the paper “Non-local neural networks”. It has two novelties, i) the non-local attention is computed 
        in a much cheaper way. Each time when computing the attentions, only the neighbors at the same row or column are utilized. 
        ii) position embedding, for each position pair (i, j), three embedding vectors are computed to be queried by query, key, 
        and value vector respectively. 
        One highlight of this paper is that the axial-attention can be used in a stand-alone way with very good performance. In other 
        words, remove the traditional convolutional layer and only use this kind of forward operation. 
        First I watched the short explanation video by the paper authors. I learned the details of this paper. Then I watched the 
        explanation video by Yannic Kilcher. He said the attention has replaced LSTM, it may replace the convolution operation in 
        the next year, month, day or minute.
 1.2 Attention for NLP
    Fast Transformers with Clustered Attention, 2020 July
        Computing attention requires quadratic complexity N*N with respect to the sequence length N, thus making them prohibitively 
        expensive for large sequences. Thus cluster the queries into C clusters to reduce the computation complexity to N*C. 
        While computing attention, each query is replaced by the centroid of its centroid.
    Efficient Content-Based Sparse Attention with Routing Transformers, 2019
 	      The reviews are very very informative, https://openreview.net/forum?id=B1gjs6EtDr. Reject by ICLR 2020. 
        This paper learns the attention sparsity in a data driven fashion.  It divides the tokens to several clusters by clustering
        the sum of product of query multiplied with a projection matrix and product of key multiplied with the same projection matrix.
        This is mouthful. In math, we cluster the rows of R = Q \times W_R + K \times W_R, where W_R is the projection matrix.
 2. Taxonomizing
 2.1 Review/Surveys
    Efficient Transformers:  A Survey, Sep 14 2020. A good paper. *****, five stars.
 2.2 Papers to reduce the computation complexity 
    Fast Transformers with Clustered Attention, 2020 July
    Reformer: The Efficient Transformer, 2020 Feb
    Large Memory Layers with Product Keys, 2019 hot on Github
    Generating Long Sequences with Sparse Transformers, 2019, hot on Github
    Efficient Content-Based Sparse Attention with Routing Transformers, 2019
 2.3 Papers using “weird” connections to reduce computation complexity
    CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1
    Axial Attention in Multidimensional Transformers, 2019
    Star-Transformer, 2019
    Interlaced Sparse Self-Attention for Semantic Segmentation, 2019
    SCRAM: Spatially Coherent Randomized Attention Maps, 2019
    Transformer on a Diet, 2020 Feb
    SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection, 2020, April
 2.4 Papers to have a better performance on long text sequences
    BP-Transformer: Modelling Long-Range Context via Binary Partitioning, 2019
    Compressive Transformers for Long-Range Sequence Modelling, 2019
    Longformer: The Long-Document Transformer, April 2019, hot on GitHub
 2.5 Applications
    Jukebox: A Generative Model for Music, 2020 April, Explained by Yannic
 2.6 TODO papers
    SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks, 2020 Jun
    Equivariant Transformer Networks, 2019
    Improving the Robustness of Capsule Networks to Image Affine Transformations
    Capsule-Transformer for Neural Machine Translation
    Visual-textual Capsule Routing for Text-based Video Segmentation
    Group Feedback Capsule Network
    Capsules for Object Segmentation (SegCaps)
    Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, 2019

 3.  Resources/credits of this article 
    https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/ Blog on many attentions.
    https://twitter.com/ilyasut/status/1111057995650072576?lang=en little known fact, the transformer is a close cousin of the capsule network, because soft attention is "routing by agreement".
    http://www.cs.toronto.edu/~hinton/absps/object-based81.pdf A paper in 1981 by Hinton related to CNNs and capsule networks
    https://www.cnblogs.com/xiximayou/p/13192378.html, 改进卷积操作的一些论文
    https://github.com/cszn/KAIR
    https://mp.weixin.qq.com/s?__biz=MzI5MDUyMDIxNA==&mid=2247505981&idx=3&sn=eebcfd42b1f6eaeb265c983c05f657d5&chksm=ec1c35c4db6bbcd25bee7fc27c5380da0ab502a3ffcb6767588f199c8b5d15574ebace68b18a&mpshare=1&scene=1&srcid=08287QPmzCUFem64oVejWEFA&sharer_sharetime=1598570878555&sharer_shareid=76cd298123cfe939784e14c0c20d80be&key=10b5f81a6836622319d9c3cf6c2ba88426e39fb9095f0d3aa64bf812d60c27e9492df3c2aa1eeeb1c8c5b9a7b5fd943019926b4228dd1a6e2775f9908b54c07a573c7c298dd6746d2fc3f197c7a476eadf47cafabb41924bf4fbba372f6a46ae6e1821d5c41ae116245ec10e9afe1d776b848d2bd816edadee2a01e6c7e26489&ascene=1&uin=MjAwOTEzMzgzNA%3D%3D&devicetype=Windows+10+x64&version=62090529&lang=en&exportkey=AzryKHsZgMlDLj0wVXHGWrE%3D&pass_ticket=fwwWHvq2CqMcdTKhhkZmShqwghX3ZRjk2WAlC80EsasaDDe3zwRGGRlWEFOX1Ioh&wx_header=0 魔改Attention大集合
	1. Short walk-through on attention papers
	1.1 Attention for CNNs
	Squeeze-and-Excitation Networks 2017, figure. 2
	Generate a-single-value attention for each channel.
	CBAM: Convolutional Block Attention Module 2018, figure. 1 and 2
	For CNNs. Combine channel attention and spatial attention: first use channel attention, then spatial attention.
	Non-local neural networks 2017, figure. 2
	The output of a convolutional layer consists of THW*C (batch size, height, width, channels) units.
	For each unit, calculate the effect of all other units to it. This results THW*C values.
	Add these values to the original output units.
	CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1
	A work based on Non-local neural networks. This work optimizes “For each unit, calculate the effect of all other units to it”:
	i) GPU memory friendly; ii) 2) High computational efficiency; iii) The state-of-the-art performance.
	It first calculates the effect of units that are the same row as well as the same column of the target unit.
	Then repeat “calculates the effect of units that are the same row or same column of the target unit”.
	After one repeat, a unit would receive effect from all other units.
	Dynamic Convolution: Attention over Convolution Kernels
	https://www.youtube.com/watch?v=FNkY7I2R_zM Let a traditional (compared to work in this paper) layer has N kernels.
	The work uses K kernels to aggregate one of the N kernels. The attention for the K kernels is computed from the input
	data (or input feature maps). This trick makes the model having much more parameters (so the experiments are
	operated on MobileNet), but the computation cost almost remains the same, because computing the attention (max pool
	the input and then apply a MLP) and aggregating the kernels (weight sum on small kernels) is not computational expensive.
	Exploring Self-attention for Image Recognition
	https://blog.csdn.net/gitcat/article/details/106984692, https://www.pianshen.com/article/58031385858/
	Axial Attention in Multidimensional Transformers, 2019
	Super similar to CCNet. It calculates the effect of units that are the same row or same column of the target unit.
	Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
	https://www.youtube.com/watch?v=-iAXF-vibdE https://www.youtube.com/watch?v=hv3UO3G0Ofo
	This work is based on the paper “Non-local neural networks”. It has two novelties, i) the non-local attention is computed
	in a much cheaper way. Each time when computing the attentions, only the neighbors at the same row or column are utilized.
	ii) position embedding, for each position pair (i, j), three embedding vectors are computed to be queried by query, key,
	and value vector respectively.
	One highlight of this paper is that the axial-attention can be used in a stand-alone way with very good performance. In other
	words, remove the traditional convolutional layer and only use this kind of forward operation.
	First I watched the short explanation video by the paper authors. I learned the details of this paper. Then I watched the
	explanation video by Yannic Kilcher. He said the attention has replaced LSTM, it may replace the convolution operation in
	the next year, month, day or minute.
	1.2 Attention for NLP
	Fast Transformers with Clustered Attention, 2020 July
	Computing attention requires quadratic complexity N*N with respect to the sequence length N, thus making them prohibitively
	expensive for large sequences. Thus cluster the queries into C clusters to reduce the computation complexity to N*C.
	While computing attention, each query is replaced by the centroid of its centroid.
	Efficient Content-Based Sparse Attention with Routing Transformers, 2019
	The reviews are very very informative, https://openreview.net/forum?id=B1gjs6EtDr. Reject by ICLR 2020.
	This paper learns the attention sparsity in a data driven fashion. It divides the tokens to several clusters by clustering
	the sum of product of query multiplied with a projection matrix and product of key multiplied with the same projection matrix.
	This is mouthful. In math, we cluster the rows of R = Q \times W_R + K \times W_R, where W_R is the projection matrix.
	2. Taxonomizing
	2.1 Review/Surveys
	Efficient Transformers: A Survey, Sep 14 2020. A good paper. *****, five stars.
	2.2 Papers to reduce the computation complexity
	Fast Transformers with Clustered Attention, 2020 July
	Reformer: The Efficient Transformer, 2020 Feb
	Large Memory Layers with Product Keys, 2019 hot on Github
	Generating Long Sequences with Sparse Transformers, 2019, hot on Github
	Efficient Content-Based Sparse Attention with Routing Transformers, 2019
	2.3 Papers using “weird” connections to reduce computation complexity
	CCNet: Criss-Cross Attention for Semantic Segmentation 2018, figure. 1
	Axial Attention in Multidimensional Transformers, 2019
	Star-Transformer, 2019
	Interlaced Sparse Self-Attention for Semantic Segmentation, 2019
	SCRAM: Spatially Coherent Randomized Attention Maps, 2019
	Transformer on a Diet, 2020 Feb
	SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection, 2020, April
	2.4 Papers to have a better performance on long text sequences
	BP-Transformer: Modelling Long-Range Context via Binary Partitioning, 2019
	Compressive Transformers for Long-Range Sequence Modelling, 2019
	Longformer: The Long-Document Transformer, April 2019, hot on GitHub
	2.5 Applications
	Jukebox: A Generative Model for Music, 2020 April, Explained by Yannic
	2.6 TODO papers
	SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks, 2020 Jun
	Equivariant Transformer Networks, 2019
	Improving the Robustness of Capsule Networks to Image Affine Transformations
	Capsule-Transformer for Neural Machine Translation
	Visual-textual Capsule Routing for Text-based Video Segmentation
	Group Feedback Capsule Network
	Capsules for Object Segmentation (SegCaps)
	Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, 2019

	3. Resources/credits of this article
	https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/ Blog on many attentions.
	https://twitter.com/ilyasut/status/1111057995650072576?lang=en little known fact, the transformer is a close cousin of the capsule network, because soft attention is "routing by agreement".
	http://www.cs.toronto.edu/~hinton/absps/object-based81.pdf A paper in 1981 by Hinton related to CNNs and capsule networks
	https://www.cnblogs.com/xiximayou/p/13192378.html, 改进卷积操作的一些论文
	https://github.com/cszn/KAIR
	https://mp.weixin.qq.com/s?__biz=MzI5MDUyMDIxNA==&mid=2247505981&idx=3&sn=eebcfd42b1f6eaeb265c983c05f657d5&chksm=ec1c35c4db6bbcd25bee7fc27c5380da0ab502a3ffcb6767588f199c8b5d15574ebace68b18a&mpshare=1&scene=1&srcid=08287QPmzCUFem64oVejWEFA&sharer_sharetime=1598570878555&sharer_shareid=76cd298123cfe939784e14c0c20d80be&key=10b5f81a6836622319d9c3cf6c2ba88426e39fb9095f0d3aa64bf812d60c27e9492df3c2aa1eeeb1c8c5b9a7b5fd943019926b4228dd1a6e2775f9908b54c07a573c7c298dd6746d2fc3f197c7a476eadf47cafabb41924bf4fbba372f6a46ae6e1821d5c41ae116245ec10e9afe1d776b848d2bd816edadee2a01e6c7e26489&ascene=1&uin=MjAwOTEzMzgzNA%3D%3D&devicetype=Windows+10+x64&version=62090529&lang=en&exportkey=AzryKHsZgMlDLj0wVXHGWrE%3D&pass_ticket=fwwWHvq2CqMcdTKhhkZmShqwghX3ZRjk2WAlC80EsasaDDe3zwRGGRlWEFOX1Ioh&wx_header=0 魔改Attention大集合