2016-CVPRW-Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks

遥感图像的 Semantic Segmentation 中有 small object 的问题，比如 car

Overall

用的是 ISPRS 2D Semantic Labeling Contest 中的数据，输入数据有 5 个通道，我想应该是 RGB+NIR+DSM。

Patch-based Pixel Classification

Data

用 65 × 65 pixel 的 patch 来分类，中心像素的类别就是这个 patch 的类别。

Conv 1: In_Channel = 5, Spatial = 5 x 5, Out_Channel = 32，用 k_w * k_h * c_in * c_out 来标记，那么就是 5 * 5 * 5 * 32

Conv 1 (5×5×5x32) + ReLU + BN + 3×3 max-pooling layer (stride = 1) Conv 2: Out_Channel = 64, 5×5×32x64

Conv 3: Out_Channel = 96, 5×5×64x96

Conv 4: Out_Channel = 128, 5×5×96x128

Conv 1 (5 × 5 × 5 x 32) + ReLU + BN + 3 × 3 max-pooling layer (stride = 1)
Conv 2 (5 × 5 × 32 x 64) + ReLU + BN + 3 × 3 max-pooling layer (stride = 1)
Conv 3 (5 × 5 × 64 x 96) + ReLU + BN + 3 × 3 max-pooling layer (stride = 1)
Conv 4 (5 × 5 × 96 x 128) + ReLU + BN + 3 × 3 max-pooling layer (stride = 1)
FC (128) + Dropout (0.5)
FC (5) + Dropout (0.5)
Softmax (5) 注意 max-pooling 的 stride = 1 是为了避免 down-sampling

Pixel-to-pixel Segmentation

输入图像大小：256 x 256

Layer 1
- Conv1_1 (3 x 3, stride = 2) + ReLU + BN
- Conv1_2 (3 x 3, stride = 1) + ReLU + BN
- 2 × 2 max pooling (stride = 2)
Layer 2
- Conv2_1 (3 x 3, stride = 1) + ReLU + BN
- Conv2_2 (3 x 3, stride = 1) + ReLU + BN
- 2 × 2 max pooling (stride = 2)
Layer 3
- Conv3_1 (3 x 3, stride = 1) + ReLU + BN
- Conv3_2 (3 x 3, stride = 1) + ReLU + BN
- 2 × 2 max pooling (stride = 2)
Layer 4
- Conv4_1 (3 x 3, stride = 1) + ReLU + BN
- Conv4_2 (3 x 3, stride = 1) + ReLU + BN
- Conv4_3 (1 x 1, Out = nclass) + ReLU + BN
Trans Conv 1
Trans Conv 2
Softmax

这网络还是 16 倍的下采样。

Data Augmentation: 50 % Overlap, flip (left to right and up down)，rotated at 90 degree intervals （3 个），怎么做到 8 Augmentations？保持不变，left to right 和 up down 这样 3 个，然后 rotate 可以 90, 180, 270 3 个方向，所以一共 9 个吧

Loss 其实就是一个 Weighted Cross Entropy Loss， $$ L=-\frac{1}{N} \sum_{n=1}^{N} \sum_{c=1}^{C} l_{c}^{(n)} \log \left(\hat{p}{c}^{(n)}\right) w{c} $$

其中权重为

$$ w_{c}=\frac{\mathrm{median}\left(\left{f_{c} | c \in C\right}\right)}{f_{c}} $$

$f_c$ 就是 c 类像素在所有类别像素中所占的比例，其实这么做跟把 $w_c$ 直接设置成像素比例功能是一样的，用上面公式唯一的作用就是 median 那一类的 $w_c = 1$，但这其实就是对 loss 乘以一个常数的区别，并不影响优化和最后的结果。

YimianDai/Kampffmeyer2016SemanticSO.md

Overall

Patch-based Pixel Classification

Data

Pixel-to-pixel Segmentation