## DCNNs应用在语义分割任务中问题

stride>1

## 文中对应的解决办法

In order to overcome this hurdle and efficiently produce denser feature maps, we remove the downsampling operator from the last few max pooling layers of DCNNs and instead upsample the filters in subsequent convolutional layers, resulting in feature maps computed at a higher sampling rate. Filter upsampling amounts to inserting holes (‘trous’ in French) between nonzero filter taps.

A standard way to deal with this is to present to the DCNN rescaled versions of the same image and then aggregate the feature or score maps. We show that this approach indeed increases the performance of our system, but comes at the cost of computing feature responses at all DCNN layers for multiple scaled versions of the input image. Instead, motivated by spatial pyramid pooling, we propose a computationally efficient scheme of resampling a given feature layer at multiple rates prior to convolution. This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, we efficiently implement this mapping using multiple parallel atrous convolutional layers with different sampling rates; we call the proposed technique “atrous spatial pyramid pooling” (ASPP).

Our work explores an alternative approach which we show to be highly effective. In particular, we boost our model’s ability to capture fine details by employing a fully-connected Conditional Random Field (CRF) [22]. CRFs have been broadly used in semantic segmentation to combine class scores computed by multi-way classifiers with the low-level information captured by the local interactions of pixels and edges [23], [24] or superpixels [25]. Even though works of increased sophistication have been proposed to model the hierarchical dependency [26], [27], [28] and/or high-order dependencies of segments [29], [30], [31], [32], [33], we use the fully connected pairwise CRF proposed by [22] for its efficient computation, and ability to capture fine edge details while also catering for long range dependencies.

## DeepLab V2的优势

state-of-art

From a practical standpoint, the three main advantages of our DeepLab system are: (1) Speed: by virtue of atrous convolution, our dense DCNN operates at 8 FPS on an NVidia Titan X GPU, while Mean Field Inference for the fully-connected CRF requires 0.5 secs on a CPU. (2) Accuracy: we obtain state-of-art results on several challenging datasets, including the PASCAL VOC 2012 semantic segmentation benchmark [34], PASCAL-Context [35], PASCAL-Person-Part [36], and Cityscapes [37]. (3) Simplicity: our system is composed of a cascade of two very well-established modules, DCNNs and CRFs.

## Learning rate policy

l r × ( 1 − i t e r m a x _ i t e r ) p o w e r lr \times (1 – \frac{iter}{max\_iter})^{power} l r × ( 1 − m a x _ i t e r i t e r ​ ) p o w e r

l r lr l r 为初始学习率，
i t e r iter i t e r

m a x _ i t e r max\_iter m a x _ i t e r

## 消融实验

MSC表示多尺度输入，即先将图像缩放到0.5、0.7和1.0三个尺度，然后分别送入网络预测得到score maps，最后融合这三个score maps（对每个位置取三个score maps的最大值）。

Multi-scale inputs: We separately feed to the DCNN images at scale = { 0.5, 0.75, 1 } , fusing their score maps by taking the maximum response across scales for each position separately.

COCO就代表在COCO数据集上进行预训练

Models pretrained on MS-COCO.

Aug代表数据增强，这里就是对输入的图片在0.5到1.5之间随机缩放。

Data augmentation by randomly scaling the input images (from 0.5 to 1.5) during training.

LargeFOV是在DeepLab V1中讲到过的结构。
ASPP前面讲过了
CRF前面也提到过了

we evaluate how each of these factors, along with LargeFOV and atrous spatial pyramid pooling (ASPP), affects val set performance. Adopting ResNet-101 instead of VGG-16 significantly improves DeepLab performance (e.g., our simplest ResNet-101 based model attains 68.72%, compared to 65.76% of our DeepLab-LargeFOV VGG-16 based variant, both before CRF). Multiscale fusion [17] brings extra 2.55% improvement, while pretraining the model on MS-COCO gives another 2.01% gain. Data augmentation during training is effective (about 1.6% improvement). Employing LargeFOV (adding an atrous convolutional layer on top of ResNet, with 3×3 kernel and rate = 12) is beneficial (about 0.6% improvement). Further 0.8% improvement is achieved by atrous spatial pyramid pooling(ASPP). Post-processing our best model by dense CRF yields performance of 77.69%.