bretsko · October 5, 2023 13:27
diff --git a/gistfile1.txt b/gistfile1.txt
 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 Contents lists available at ScienceDirect

 ISPRS Journal of Photogrammetry and Remote Sensing
 journal homepage: www.elsevier.com/locate/isprsjprs

 Cloud removal in Sentinel-2 imagery using a deep residual neural network
 and SAR-optical data fusion

 T

 Andrea Meranera,1, Patrick Ebela, Xiao Xiang Zhua,b, , Michael Schmitta,
 ⁎

 a
 b

 ⁎

 Signal Processing in Earth Observation, Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany
 Remote Sensing Technology Institute, German Aerospace Center (DLR), Münchener Straße 20, 82234 Weßling-Oberpfaffenhofen, Germany

 ARTICLE INFO

 ABSTRACT

 Keywords:
 Cloud removal
 Optical imagery
 SAR-optical
 Data fusion
 Deep learning
 Residual network

 Optical remote sensing imagery is at the core of many Earth observation activities. The regular, consistent and
 global-scale nature of the satellite data is exploited in many applications, such as cropland monitoring, climate
 change assessment, land-cover and land-use classification, and disaster assessment. However, one main problem
 severely affects the temporal and spatial availability of surface observations, namely cloud cover. The task of
 removing clouds from optical images has been subject of studies since decades. The advent of the Big Data era in
 satellite remote sensing opens new possibilities for tackling the problem using powerful data-driven deep
 learning methods.
 In this paper, a deep residual neural network architecture is designed to remove clouds from multispectral
 Sentinel-2 imagery. SAR-optical data fusion is used to exploit the synergistic properties of the two imaging
 systems to guide the image reconstruction. Additionally, a novel cloud-adaptive loss is proposed to maximize the
 retainment of original information. The network is trained and tested on a globally sampled dataset comprising
 real cloudy and cloud-free images. The proposed setup allows to remove even optically thick clouds by reconstructing an optical representation of the underlying land surface structure.

 1. Introduction
 1.1. Motivation
 While the quality and quantity of satellite observations dramatically
 increased in recent years, one common problem persists for remote
 sensing in the optical domain since the first observation until today:
 cloud cover. As thick clouds appear opaque in all optical frequency
 bands, the presence thereof completely corrupts the reflectance signal
 and obstructs the view of the surface underneath. This causes considerable data gaps in both the spatial and temporal domains. For applications where consistent time series are needed, e.g. agricultural
 monitoring, or where a certain scene must be observed at a specific
 time, e.g. disaster monitoring, cloud cover represents a serious hindrance.
 The problem of cloud cover becomes even more apparent considering the amount of cloud coverage the Earth’s surface experiences
 every day. An analysis over 12 years of observations by the Moderate
 Resolution Imaging Spectroradiometer (MODIS) instrument aboard the

 satellites Terra and Aqua showed that 67% of the Earth’s surface is
 covered by clouds on average (King et al., 2013). Over land surfaces,
 the cloud fraction averages to 55%, featuring distinctive seasonal patterns. Considering the importance of these cloud occlusion percentages,
 it becomes clear how a successful cloud removal algorithm would
 greatly increase the availability of useful data. The task of detecting and
 removing clouds from satellite images has been tackled since the beginning of Earth observation activities, and is still today an area of
 active research. In this work, we present a deep learning model capable
 of removing clouds from Sentinel-2 images. The network design and the
 integration of additional Sentinel-1 SAR data makes it robust to extensive cloud coverage conditions. The model is trained on a large
 dataset containing scenes acquired globally, ensuring its general applicability on any land cover type.
 1.2. Related works
 The reconstruction of missing information in remote sensing data is
 a long-studied problem. In Shen et al. (2015), a comprehensive review

 Corresponding authors at: Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Oberpfaffenhofen, 82234 Wessling, Germany and Signal
 Processing in Earth Observation, Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany.
 E-mail addresses: [email protected] (A. Meraner), [email protected] (P. Ebel), [email protected] (X.X. Zhu), [email protected] (M. Schmitt).
 1
 Present address: EUMETSAT, Eumetsat Allee 1, 64295 Darmstadt, Germany.
 ⁎

 https://doi.org/10.1016/j.isprsjprs.2020.05.013
 Received 14 January 2020; Received in revised form 15 May 2020; Accepted 18 May 2020
 Available online 02 July 2020
 0924-2716/ © 2020 The Authors. Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). This is an
 open access article under the CC BY license (http://creativecommons.org/licenses/BY/4.0/).

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 In addition to the conceptual considerations, the need of large datasets
 is also a prominent problem in deep learning for cloud removal. The
 studies cited above achieve promising results, but the used datasets are
 very limited and the performance is evaluated on non-independent
 data. An assessment of the generalization capability of the networks,
 i.e. their ability to remove clouds on previously unseen scenes, is
 therefore not directly possible. In contrast, we present and use a large
 dataset that is suited for a deterministic separation of images for
 training and testing purposes and thus provides a sound idea of how
 well the network will generalize to unseen Sentinel-2 data.

 of traditional techniques in provided. In the last decades, a multitude of
 approaches have been proposed for the specific task of cloud removal in
 optical imagery. Methods that follow traditional approaches can be
 categorized into three major clusters, namely multispectral, multitemporal and inpainting techniques. Many methods are a hybrid combination of these categories. Multispectral approaches are applied in the
 case of haze and thin cirrus clouds, where optical signals are not
 completely blocked but experience partial wavelength-dependent absorption and reflection. In such cases, surface information is partly
 present and can be restored, e.g. using mathematical (Xu et al., 2019;
 Hu et al., 2015) or physical models (Xu et al., 2016; Lv et al., 2016).
 Multispectral methods have the advantage of exploiting information
 from the original scene without requiring additional data, but are
 limited to filmy, semi-transparent clouds. Multitemporal approaches
 restore cloudy scenes by integrating information from reference images
 acquired with clear sky conditions (Lin et al., 2013; Li et al., 2015;
 Ramoino et al., 2017; Ji et al., 2018). For this, also multitemporal
 dictionary learning techniques can be used (Li et al., 2014). The multitemporal data may also come from different sensors on different satellites (Li et al., 2019). Multitemporal methods are the most popular as
 they substitute corrupted pixels with real cloud-free observations.
 However, problems arise when reconstructing scenes with rapidly
 changing surface conditions (e.g. due to phenological events) because
 of the time difference between the scene to be reconstructed and the
 reference acquisition. Inpainting approaches fill corrupted regions by
 exploiting surface information from clear parts of the same cloud-affected image (Meng et al., 2017). Such direct inpainting methods do not
 require additional images, but achieve good results only with small
 clouds. To mitigate this problem, the process of selecting the most
 suitable similar pixel to be cloned is often guided by auxiliary data, e.g.
 multitemporal (Cheng et al., 2014) or SAR images (Eckardt et al.,
 2013). Such methods deliver good results but have an increased complexity due to the requirement of multitemporal or multisensorial additional data.
 In parallel to traditional approaches for cloud removal, data-driven
 methods using deep learning have been gaining attention recently.
 Many of the problems arising from traditional algorithms can be potentially solved by the end-to-end learning of deep neural networks
 (DNN). For example, the detection and segmentation of clouds as a
 preliminary step is often not required, as it can be learned implicitly by
 the networks. In the case of multisensor data fusion, the translation
 between different sensor domains can also be learned. Moreover, DNNs
 can be trained to cope with any type of cloud and residual atmospheric
 conditions. A first paper exploiting the potential of DNNs for restoring
 missing information in remote sensing imagery was published in Zhang
 et al. (2018). The method uses a spatial–temporal-spectral convolutional neural network (CNN) to restore data gaps in Landsat TM data. In
 the case of clouds, an additional multitemporal image of the same scene
 is used to support the reconstruction. Recent papers have been focusing
 on using a modern CNN architecture called conditional generative adversarial network (cGAN) (Mirza and Osindero, 2014). In Enomoto
 et al. (2017), a cGAN is trained to remove simulated clouds from
 Worldview-2 RGB images using NIR images as auxiliary data, while in
 Grohnfeldt et al. (2018) a cGAN removes simulated clouds from Sentinel-2 imagery using SAR data as additional information. An evolution
 of the cGAN, called Cycle-GAN, can be used to avoid the need of paired
 cloudy-cloudfree images for training (Singh and Komodakis, 2018). A
 different approach for generating cloud-free images is to perform a
 direct translation from SAR to optical using cGANs (Bermudez et al.,
 2018; Bermudez et al., 2019; He and Yokoya, 2018; Fuentes Reyes
 et al., 2019). Besides their powerful generative capabilities, cGANs can
 suffer from training and prediction instabilities when fed with bad input
 data (e.g. large cloud coverage), as reported in some of the referenced
 studies and in Mescheder et al. (2018). Based on these experiences, the
 work presented in this paper develops a model architecture that is robust to the presence of large and optically thick clouds in the input data.

 1.3. Paper structure
 This paper is structured as follows. After this introductory section,
 the characteristics of the used dataset are presented in Section 2. The
 proposed methodology, including the designed neural network architecture and custom loss, are explained in Section 3. The conducted
 experiments and obtained results are then presented in Section 4 and
 further discussed in 5. Finally, a summary and conclusions are given in
 Section 6.
 2. Data
 While the data-driven method proposed in this paper is of generic
 nature and sensor-agnostic, the specific model we train and our experiments focus on satellite imagery provided by the Sentinel satellites
 of the European Copernicus Earth observation program (Desnos et al.,
 2014), as these data are globally and freely available in a user-friendly
 manner.
 2.1. Sentinel-1 and Sentinel-2 missions
 The cloud removal algorithm developed in this work is applied on
 optical data from the Copernicus Sentinel-2 mission (Drusch et al.,
 2012). The mission provides data for risk management, land use/land
 cover and environmental monitoring, as well as urban and terrestrial
 mapping for humanitarian and development aid. Imagery is available
 over all main land areas from −56° to 84° of latitude with a global
 revisit time of 5 days at the equator. The optical payload is called Multi
 Spectral Instrument (MSI) and comprises 13 spectral bands. Four 10 m
 high-resolution bands are placed in the visible and NIR domain for core
 mapping applications. Six 20 m resolution bands are used for environmental monitoring and high-level products. Three 60 m bands are
 used for detection and correction of atmospheric effects. The swath
 width is 290 km.
 The SAR data used in this work originates from the Copernicus
 Sentinel-1 mission (Torres et al., 2012). The C-band radar instrument
 (5.4 GHz center frequency) on board of the two constellation satellites
 can operate in various modes depending on the position of the satellite
 and the scope of the observations. The main operational mode, called
 Interferometric Wide Swath (IW), is used over land surfaces and features a swath of 250 km and a resolution of 5 m in range and 20 m in
 azimuth direction. The combined revisit time is of 6 days. The Sentinel1 mission was designed to provide data in all weather situations for
 maritime and land monitoring, emergency response, climate change
 and security.
 2.2. SEN12MS-CR Dataset
 The dataset presented and used in this work, called SEN12MS-CR, is
 an evolution of the SEN12MS dataset (Schmitt et al., 2019b). SEN12MS
 is publicly available and contains triplets of cloud-free Sentinel-2 optical images, Sentinel-1 SAR images and MODIS land cover maps. It was
 developed for common remote sensing applications, such as scene
 classification or semantic segmentation for land cover mapping. Using
 the same procedure as described in the original paper, SEN12MS-CR
 334

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 was created specifically as a dataset for training deep learning models
 for cloud removal.
 SEN12MS-CR contains 169 non-overlapping regions of interest
 (ROIs) sampled across all inhabited continents during all meteorological seasons. The scene locations are randomly drawn from two
 uniform distributions, namely one over all landmasses and one over
 urban areas only. This introduces a bias towards urban landscapes, that
 are often in the focus of remote sensing studies and contain more
 complex patterns. The ROIs have an average size of approx. 5200 × 4000
 px, which corresponds to 52 × 40 km ground coverage due to the pixels
 having 10 m ground sampling distance. Each ROI is composed of a
 triplet of orthorectified, geo-referenced cloudy and cloud-free Sentinel2 images, as well as the correspondent Sentinel-1 image. All three
 images were acquired within the same meteorological season to limit
 surface changes. To assess the cloud coverage of the optical images, the
 cloud detector described in Schmitt et al. (2019a) was used. The cloudfree Sentinel-2 images have been selected with a threshold of 10%
 cloud coverage, while cloudy images are within 20% and 70% of cloud
 coverage.
 The Sentinel-2 data is from the Level-1C top-of-atmosphere reflectance product and has values in the range [0, 10,000]. All 13 original bands were included. The Sentinel-1 data is from the Level-1 GRD
 product acquired in IW mode with two polarization channels (VV and
 VH). The values are 0 backscatter coefficients that have been transformed into dB scale.
 To adapt the images for the ingestion into a CNN, the ROIs were cut
 into small 256 × 256 px patches with a 128 px stride. The amount of
 overlap between neighboring patches is therefore 50%. This has been
 chosen to maximize the number of patches extractable from an image,
 while still ensuring an acceptable independency. An automated and
 manual check of the generated patches was performed to eliminate
 mosaicking artifacts and other corrupted regions. The final qualitycontrolled SEN12MS-CR dataset contains 157, 521 patches-triplets with
 a total of 28 layers, amounting to around 620 GB of storage size. Fig. 1
 shows examples of patch triplets from the dataset. In the deep-learning
 based cloud removal algorithms cited in the related works, the networks are trained on datasets with clear limitations. E.g., in Enomoto
 et al. (2017), Grohnfeldt et al. (2018), Zhang et al. (2018) the networks
 are trained exclusively on simulated clouds, by using simple Perlin
 noise or introducing manually gaps into the imagery. In Singh and
 Komodakis (2018), a dataset of real unpaired cloudy and cloud-free
 Sentinel-2 images is used, which however is limited to the RGB channels and comprises only 20 cloudy and 13 cloud-free scenes. In Hu et al.
 (2015), a dataset of ten paired cloudy and cloud-free scenes acquired by
 Landsat-8 is used. However, the cloud-contaminated images contain
 only filmy, partly-transparent clouds. To the best of the authors’
 knowledge, SEN12MS-CR is the first dataset used for training cloud
 removal networks that comprises a large and representative number of
 scenes sampled worldwide, with full multispectral information, containing different types of real-life clouds with their characteristic signature in all channels.

 3. A ResNet architecture for cloud removal
 3.1. ResNet principle
 The deep learning model used as backbone for this work is based on
 the popular ResNet architecture (He et al., 2016a). ResNets make use of
 shortcut connections, operations that skip some layers to shuttle the
 information to lower parts of the network, acting as a direct path for
 information flow. In the original ResNet case, the shortcut connection
 performs an additive identity mapping, i.e. the input state of a residual
 block is added to the output of the bypassed layers.
 To further understand the residual learning rationale, let H (x ) be
 the mapping that the skipped layers are supposed to learn as in a traditional plain network starting from the input x . By adding the additive
 skip connection, we let the layers explicitly learn a residual function
 F (x ) instead:

 F (x ) = H (x )

 x.

 (1)

 This is helpful since it preconditions the task: learning a residual correction to the input has proven to be easier for current optimizers than
 learning the entire input–output mapping from scratch. This is especially true when the optimal mapping for a residual unit is actually
 close to the identity, i.e. when the network has to just reproduce the
 input data in the output.
 3.2. Residual learning for cloud removal
 For the task of cloud removal, the residual skip connections of a
 ResNet are helpful in several ways:

 • Filmy clouds correction: Residual learning offers a clear advantage

 •
 •

 •

 2.3. Train, validation and test datasets
 To properly assess the generalization capability of a network, a
 training, validation and test dataset split must be performed. For this,
 the 169 ROIs of SEN12MS-CR were split into 149 scenes for training, 10
 for validation and 10 for testing, following a random global distribution. Fig. 2 shows the spatial distribution of the ROIs. The split according to the ROIs, rather than the patches, ensures that the three
 datasets are spatially and temporally completely disparate. All three
 datasets contain acquisitions from all meteorological seasons. A visual
 and automated analysis confirmed that all three datasets also have a
 similar distribution of cloud types and coverage amount. When separating the patches according to this split, the training dataset amounts to
 134, 907 patches-triplets, the validation to 11, 921 and the test to 10, 693.

 in the presence of filmy clouds. In this case, the network has to learn
 only an additive correction that compensates for the thin cloud
 disturbance in the overcast regions. Through the band concatenation, the network is able to access both the spectral and spatial
 features; the still partially present ground information acts as a good
 preconditioning for the restoration process.
 Cloud-free parts reproduction: Due to the large field of view and the
 comparably small size of clouds, satellite images are typically a
 mixture of cloudy and cloud-free regions. Over clear-sky regions, the
 residual connections offer a direct path to transfer unmodified surface information directly to the output.
 Stability of prediction: a ResNet architecture for cloud removal is
 robust to the presence of large and optically thick clouds in the input
 data. Even if an input cloudy image is mostly covered by opaque
 clouds, the network is at least able to reproduce adequately the
 cloud-free sections. C-GAN based methods (e.g. see Singh and
 Komodakis, 2018), tend to suffer from prediction instabilities or
 complete failures with bad input data.
 Optimized learning of deep models: High representational capacity
 given by a large number of layers and filters in CNNs is required to
 reconstruct the signal under thick clouds, where complex structures
 need to be restored. The ResNet architecture allows to optimize
 large and deep models in a comparably fast way and with good
 performance (He et al., 2016b).

 3.3. DSen2-CR model
 The proposed model, called DSen2-CR, is based on the super-resolution Deep Sentinel-2 (DSen-2) ResNet presented in Lanaras et al.
 (2018), which is itself derived from the state-of-the-art single-image
 super-resolution EDSR network (Lim et al., 2017). Similarly to superresolution, cloud removal can be seen as an image reconstruction task,
 where missing spatial and spectral information has to be integrated into
 the image to restore the complete information content. To guide the
 reconstruction process under thick, optically impenetrable clouds
 335

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 Fig. 1. Example 256 × 256 px patch triplets from the SEN12MS-CR dataset. (a,d,g) are the input cloudy optical images, (b,e,h) are the input SAR channels, and
 (c,f,i) are the target cloud-free optical images. Throughout the paper, the shown optical images are enhanced true-color RGB composites from the Sentinel-2 10 m
 resolution B4-B3-B2 bands. The shown SAR images are a composite of the two polarization channels (G = VH, B = VV, R = 0).

 where no ground information is available, DSen2-CR leverages a SAR
 image as a form of prior. For this, a Sentinel-1 image of the same scene
 is introduced to the network as an additional input. The image's SAR
 channels are simply concatenated to the other channels of the input
 optical image. The highly non-linear SAR-to-optical translation, as well
 as the cloud detection and treatment, are learned and performed implicitly inside the network. The training is done in an end-to-end setup,
 and a cloud-free image of the same scene is presented to the network as
 a target for the loss computation. Fig. 3 shows a diagram of the DSen2CR model and the used residual block design. In the following, further
 properties and peculiarities of the network are described:

 • Long

 •

 skip connection: An additive shortcut shuttles the input
 cloudy image to an addition layer right before the final output, as
 originally proposed in Lanaras et al. (2018). This basically means
 that the entire network is learning to predict a residual map that
 contains corrections to each pixel of the input cloudy image. In the
 case of a clear sky input or filmy clouds, the predicted corrections
 will be minor or non-existent. Conversely, for thick clouds with
 bright appearance, the corrections will be larger.
 Residual blocks: The main part of the network consists of several
 residual units stacked in sequence. The specific number of units B in

 •
 336

 the network is a hyperparameter that defines the depth of the network. The residual units each contain four layers and an addition
 layer for the residual connection. The four skipped layers are a 2D
 convolution layer with subsequent ReLU activation, a second 2D
 convolution layer and a final residual scaling layer (see next point).
 Only one ReLU activation is used after the first convolutional layer
 but not after the second, since the network is supposed to predict
 corrections that can be both positive and negative. For both convolutional layers, 3 × 3 kernels are used, following the general
 community trend to use smaller kernels in deeper models (Lanaras
 et al., 2018). The output feature dimension F, i.e. the number of
 different filters, is fixed for all units and is a hyperparameter. A
 stride of one pixel and zero padding is always used in order to
 maintain the spatial dimensions of the data throughout the network.
 Compromising between representational capacity and computational complexity, as well as considering own experiments and the
 reported experiences in Lanaras et al. (2018), Lim et al. (2017),
 residual units with F = 256 features were selected as a baseline for
 the DSen2-CR architecture.
 Residual scaling: This residual scaling layer is a custom layer that
 multiplies its inputs with a constant scalar. First proposed in
 Szegedy et al. (2017), this activations scaling has the effect of

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 Fig. 2. Global distribution of the 169 ROIs of the SEN12MS-CR dataset. Orange markers denote ROIs selected for training, green for validation and azure for testing.
 Background image credits: Google Earth/Mapmaker.
 Fig. 3. Left: DSen2-CR model diagram. Right:
 Residual block design. For each part of the network,
 the number of layers and the two spatial dimensions
 are indicated inside parentheses. Since the network
 is fully convolutional, it can accept input images of
 arbitrary spatial dimensions m during training and
 prediction time. F indicates the selected feature dimension and B the selected number of residual
 blocks included in the network.

 •

 stabilizing the training without introducing additional parameters,
 such as in batch normalization layers. The value of 0.1 is selected for
 the scaling constant in this work.
 Additional convolutions: At the beginning of the network, a concatenation layer stacks vertically the input optical and SAR layers to
 enable the joint processing. After this, a 3 × 3 convolution layer

 with ReLU activation is introduced to treat the concatenation before
 the data is passed through the residual blocks. After the last residual
 unit, a final 3 × 3 convolution restores the spectral dimensions to
 match the number of bands of the optical image before reaching the
 residuals addition layer.

 337

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 (a)

 (b)

 (c)

 (d)

 Fig. 4. Example images showing changes in surface conditions between the input cloudy acquisitions (a,c) and the target cloud-free images (b,d) taken on a different
 date.

 Fig. 5. Flowchart of the cloud (left stream) and shadow (right stream) detectors employed for the mask creation used in the

 Several experiments on the network structure and residual block
 design confirmed the validity and quality of the original DSen-2 architecture. The modifications in DSen2-CR with respect to the original
 network include the adaptations required to accommodate the two SAR

 CARL

 loss.

 input layers used for guiding the reconstruction, the different number of
 input and output optical channels, and network depth as described
 above.

 338

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 (a)

 (b)

 (c)

 (d)

 (e)

 (f)

 (g)

 (h)

 (i)

 (j)

 Fig. 6. Example images showing the influence of the SAR input on an agricultural and an urban scene under heavy cloud coverage. (a,f) show the cloudy input
 images, (b,g) the input auxiliary SAR images, (c,h) the target cloud-free images. (d,i) are the model predictions without the SAR input, and (e,j) are the predictions
 of the full DSen2-CR model including the SAR input.

 ground information below clouds without modifying clear parts, it is of
 strong importance that the most possible information from the input
 image is retained in the output. To minimize the influence of ground
 changes in the target image, a custom training loss was developed in
 this work.
 Following the recommendation of Lanaras et al. (2018), the L1
 metric (mean absolute error) was used as a basic error function due to
 the robustness to large deviations and the high dynamic range of the
 Sentinel-2 data. Defining the predicted output image as P and the
 cloud-free target image as T , the classic target loss T based on the
 simple L1 distance between prediction and target can be formulated as

 Table 1
 Quantitative results computed on the hold-out test dataset. Results are reported
 for the proposed DSen2-CR network in different configurations: trained on the
 proposed CARL loss, trained on the plain L1 target loss T , and trained on
 CARL and T but without the SAR input. In the tables, Target refers to the error
 computed between the predicted image and the target cloud-free image. This is
 the loss as optimized using T . Reprod denotes the reproduction error, namely
 the error between the predicted image and the clear parts of the input image.
 This is part of the CARL loss that is explicitly optimized. Recon is the reconstruction error, namely the error between the predicted image and the
 target image inside the reconstructed clouds and shadow regions.
 (a) Test results on pixel-wise metrics
 MAE (
 Method

 TOA )

 Target Reprod

 DSen2-CR on
 DSen2-CR on
 DSen2-CR on
 DSen2-CR on

 Recon

 0.0290 0.0204 0.0266
 0.0270 0.0398 0.0266
 T
 CARL w/o SAR 0.0306 0.0188 0.0282
 0.0284 0.0389 0.0281
 T w/o SAR
 CARL

 pix2pix

 0.0292

 0.0210

 0.0274

 RMSE
 PSNR (dB)
 ( TOA )
 Target
 Target
 0.0366
 0.0343
 0.0387
 0.0361

 28.7
 29.3
 27.6
 28.8

 0.0424

 28.2

 T

 Method
 DSen2-CR
 DSen2-CR
 DSen2-CR
 DSen2-CR
 pix2pix

 on
 on
 on
 on

 CARL

 T

 w/o SAR
 w/o SAR

 CARL

 T

 SSIM

 Target

 Reprod

 Recon

 Target

 8.15
 8.07
 8.98
 8.97

 3.94
 6.33
 3.86
 6.17

 8.04
 8.13
 8.97
 9.05

 0.875
 0.878
 0.870
 0.873

 13.68

 13.93

 12.67

 0.844

 P

 T
 Ntot

 1

 ,

 (2)

 with Ntot being the total number of pixels in all channels of the optical
 images. The optimization on this plain L1 loss is simple and straightforward, but it has a drawback: the network is induced to learn, predict
 and apply unwanted surface changes, due to being trained on multitemporal data with changing ground conditions. To reduce these artifacts, a novel loss principle was developed. The idea is to incorporate a
 binary cloud and cloud-shadow mask (CSM) into the loss computation,
 and use this information to steer the learning process towards a maximized retainment of input information. This custom loss, which we call
 Cloud-Adaptive Regularized Loss ( CARL ), is formulated as

 (b) Test results on spectral and structural fidelity metrics.
 SAM (°)

 =

 target reg.
 part

 cloud adaptive part
 CARL =

 CSM

 (P

 T ) + (1 CSM)
 Ntot

 (P

 I) 1
 +

 P

 T 1
 Ntot

 (3)

 with P , T , I denoting respectively the predicted, target, and input optical images. The CSM mask has the same spatial dimensions of the
 images and pixel values 1 for clouds and shadows pixels or 0 for uncorrupted pixels. 1 denotes a matrix of ones with the same spatial dimensions as the images and the CSM. The multiplications marked with
 between the CSM and the image differences are element-wise and
 applied over all channels. In the cloud-adaptive part, the mean absolute
 error loss is computed w.r.t. the target image for cloudy or shadowed
 pixels of the input image, and w.r.t. the input image itself for clear-sky
 pixels. With this, the network learns that it shall optimize the predictions to match the cloud-free parts of the input, and use the multitemporal information only when needed, i.e. for the cloud and shadow
 reconstruction. However, when training with this cloud-adaptive part

 3.4. Cloud-adaptive regularized loss
 As described in the dataset section, the input cloudy image and the
 target cloud-free optical images have been acquired on different days,
 but within the same meteorological season. Although the time difference is limited, changes in the surface conditions between the images
 can still often be observed, especially on agricultural landscapes (see
 Fig. 4). Since the objective of a cloud removal algorithm is to restore
 339

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 (a)

 (b)

 (c)

 (d)

 (e)

 (f)

 (g)

 (h)

 (i)

 (j)

 (k)

 (l)

 Fig. 7. Example images showing the influence of the CARL loss on two agricultural scenes. (a,c) are the input images. (b,d) are the target images. (e,g) are the
 predictions obtained by training the DSen2-CR model on the plain T , and (i,k) are the predictions obtained using CARL . (f,h) and (j,l) are the respective reproduction error maps in units of top-of-atmosphere radiance. The areas within the cloud and cloud-shadow mask (CSM) are depicted in black.

 only, it was observed that the network introduced artifacts in the predicted images due to a too precise learning of the mask. To avoid this
 effect, an additional target regularization term in the form of a classic
 mean absolute error loss between prediction and target (equivalent to
 T in Eq. (2)), was added to the loss function. This additional loss induces the network to learn to produce images that still have a natural,
 smooth appearance similar to the target image. The regularization
 factor , that scales this target regularization term in Eq. (3), is a hyperparameter that effectively balances the input information retainment and the prediction artifacts. After extensive tuning, the value of
 = 1 was found to provide the best trade-off.
 The authors have found that a methodically similar context-aware
 loss was proposed in Li et al. (2019) in a more generic image processing
 context. The novelty of the described CARL approach still resides in
 how a cloud and cloud-shadow mask is created and used in the context
 of cloud removal, with the specific intent of guiding and improving the
 reconstruction performance.
 For the CSM mask implementation, which is needed during training,
 a combination of the methods proposed in Schmitt et al. (2019a) (cloud
 detection) and in Zhai et al. (2018) (cloud-shadow detection) was used.
 Fig. 5 shows the flowchart of the different processing steps for the mask
 creation. The threshold TCL = 0.2 for the cloud binarization was selected
 after a visual evaluation. The thresholds for the cloud detection were
 5
 3
 computed using the parameters TCSI = 4 and TWBI = 6 . The threshold
 values were chosen in a conservative manner to reduce false negative
 detections. We refer to the original papers for further details on the
 algorithm implementations.

 the Sentinel-2 bands is [0, 10,000], for the Sentinel-1 VV and VH polarizations it is [−25,0] and [−32.5,0], respectively. For the Sentinel-2
 data, a division by 2000 is further applied to all bands to ensure numerical stability (Lanaras et al., 2018). Similarly, the Sentinel-1 values
 are shifted into the positive domain and scaled to the range [0, 2] to
 approximately match the optical data values distribution after scaling.
 As a data augmentation step, random rotations and flips are applied to
 the images before the ingestion.
 The training framework has been implemented in the Keras open
 source deep-learning Python library with Tensorflow (Abadi et al.,
 2016) as backend, basing on the code from (Lanaras et al., 2018). The
 models were trained on a NVIDIA DGX-1 machine containing 8 P100
 GPUs.
 The weights of the network have been initialized using a uniform He
 distribution (He et al., 2015), and the biases were initialized to zero.
 Several tests with common optimizers showed that the Adam algorithm
 with integrated Nesterov momentum (Dozat, 2015) delivers the best
 performance. After a systematic search, the optimal learning rate has
 been found to be 7·10 5 for a batch size of 16.
 4. Experiments & results
 For a quantitative evaluation, we report the error metrics obtained
 by evaluating the results from the entire hold-out test dataset on different network configurations in the following. The used metrics are the
 mean absolute error (MAE) and the root-mean-square error (RMSE) in
 units of top-of-atmosphere reflectance TOA , the peak signal-to-noise
 ratio (PSNR) in decibel units, the spectral angle mapper (SAM) (Kruse
 et al., 1993) in degrees, and the unitless structural similarity index
 (SSIM) (Wang et al., 2004). The MAE, RMSE, and PSNR are popular
 evaluation metrics for pixel-wise reconstruction quality. The SAM gives
 a measure of the spectral fidelity of the reconstructed images, while the

 3.5. Preprocessing and training setup
 Prior to the ingestion into the network, the images are value-clipped
 to eliminate small amounts of anomalous pixels. The clipping range for
 340

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 (a)

 (b)

 (c)

 (d)

 (e)

 (f)

 (g)

 (h)

 (i)

 (j)

 (k)

 (l)

 (m)

 (n)

 (o)

 Fig. 8. Example images comparing the cloud removal results of our model with the pix2pix baseline network, both models receiving cloudy optical and SAR data as
 input. (a,f,k) show the cloudy input images, (b,g,l) the input auxiliary SAR images, (c,h,m) the target cloud-free images. (d,i,n) are the predictions of our DSen2-CR
 model, and (e,j,o) are the predictions of the pix2pix baseline. The results show that our model achieves higher-fidelity results, removes cloud shadows better and is
 less prone to artifacts.

 SSIM assesses spatial structure quality based on visual perception
 principles.

 SAR information also when reproducing cloud-free regions of the input
 image. Since such artifacts do not have a correspondence in the original
 optical image, this leads to a higher reproduction error (for MAE and
 SAM respectively 2% and 3% using T , and 9% and 2% using CARL ).
 However, the benefit in terms of reconstruction error (approx. 6% for
 MAE and 11% for SAM for both losses) outbalances this problem,
 making the SAR-optical data fusion concept beneficial for the overall
 cloud removal task.
 This becomes also clear by a qualitative analysis of the produced
 images. In Fig. 6, exemplary detail patches under thick cloud cover are
 presented. By comparing the predicted images with and without SAR
 prior, the gain in structural content provided by the SAR fusion is clear.

 4.1. Influence of SAR-optical data fusion
 Several experiments were dedicated to verify the usefulness of the
 SAR-optical data fusion setup used in DSen2-CR. For this, we performed
 a full network training with and without including the SAR auxiliary
 input. In Fig. 6, example results obtained on the hold-out test dataset
 are visually compared. For better comparability, both networks were
 trained using the plain L1 loss T . It can clearly be seen that the results
 which make use of SAR-optical data fusion contain much more structure than the results relying on pure optical-to-optical image translation.
 Especially large structures that have regular shapes and a distinctive
 appearance in the SAR image, e.g. the large fields in the agricultural
 example scene, are correctly included in the predicted image. Complex
 objects, e.g. in cityscapes, are harder to integrate due to their more
 complicated patterns. Here, the model is able to reconstruct the scene
 only on a coarse scale. For example, the urban example area, with the
 core town and the river entering from the south, is at least roughly
 recognizable in the predicted image generated using the SAR information, whereas it is not reconstructed at all if no SAR data is used.
 Considerations about the effectiveness of the SAR input can also be
 made by evaluating the test results reported in Table 1. Here, results
 from experimental training runs without SAR are provided alongside
 the full configurations. Comparing the numbers, the network with the
 SAR input scores better results for most evaluated metrics. Interestingly, however, the networks without SAR achieve lower MAE and SAM
 reproduction errors. This indicates that the network partly integrates

 4.2. Influence of the cloud-adaptive regularized loss
 One of the main contributions of this work is the design of the socalled cloud-adaptive regularized loss CARL . This custom loss is cloudand shadow-aware and introduces an optimization w.r.t. to the input
 image, in order to retain the most possible amount of information from
 the uncorrupted input regions. To assess the effectiveness of this proposed loss, we compare the predictions of DSen2-CR models trained on
 CARL to models trained only on the plain
 T . Fig. 7 shows example
 images from the test dataset containing two different agricultural
 landscapes subject to substantial surface changes between the input and
 the target images. By comparing the RGB composites of the results
 obtained using T and CARL with the input and the target images, it
 becomes clear how the network optimized on CARL is able to optimally
 retain input information and limit the artifact generation in the predicted images. In the left image series, for example, the blooming rapeseed fields captured in the input image are kept in a bright yellow
 341

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 (a)

 (b)

 (c)

 (d)

 (e)

 (f)

 (g)

 (h)

 (i)

 (j)

 (k)

 (l)

 Fig. 9. Example results from the final setup of DSen2-CR using the
 target images.

 CARL

 loss. (a,d,g,j) are the input cloudy images, (b,e,h,k) the predicted images, and (c,f,i,l) the

 temporal images with differing ground conditions.
 Observing the reproduction error maps shown in the figure, the
 influence of the adaptive loss is evident, with predictions from T
 showing much higher reproduction errors in the clear-sky pixels. An
 evaluation of the final test results in Table 1 shows that model trained
 on the CARL loss achieves 49% less MAE reproduction error and 38%
 less SAM reproduction error w.r.t. to the network optimized on T . The
 reconstruction errors between the two models are comparable, showing

 color by the CARL , while being changed to green by T .
 The shown error maps are the pixel-wise mean absolute error between the predicted image and the cloud-free parts of the input image.
 In the following, we call this measure reproduction error, i.e. the error
 introduced by the network while reproducing the already cloud-free
 parts of the input image into the prediction. A low reproduction error
 indicates an optimal retainment of useful input information. Moreover,
 it signifies a low artifact generation caused by the training on multi342

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 Fig. 10. Left column: channel-wise normalized root-mean-square error (nRMSE) in units of percentage for each image shown in Fig. 9. The normalization was
 performed using the value range of each band. Right column: Pixel spectra of the central pixel in the respective input, predicted, and target images. The point markers
 denote the band resolution: circles for 10 m, triangles for 20 m, and squares for 60 m resolution. (a) additionally contains labels for each band following the Sentinel2 bands na.ming convention.

 that CARL does not affect negatively the cloud reconstruction performance of the network while optimizing the information retainment
 capabilities. Considering these observations, we conclude that the usage
 of CARL in the optimization process is beneficial for the cloud removal
 task. This is particularly true for agricultural areas, which exhibit
 phenological changes even within the limited time span lying between
 the acquisition of the cloud-affected image and the acquisition of the
 cloud-free target image. It may be noted, however, that using T
 naturally leads to better results in target-only based metrics (here

 RMSE, PSNR and SSIM) since the optimization and the evaluation is
 performed on the same objective. This however does not necessarily
 signify an improvement in the overall cloud removal performance, due
 to the artifact generation in cloud-free part as discussed above.
 4.3. Comparison against baseline model
 In order to compare our model against a standard baseline, we
 utilized the popular pix2pix architecture (Isola et al., 2017) that was as
 343

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 The network weights are initialized with a Normal initialization and
 biases are set to zero, The network is trained on the complete training
 set via ADAM (Karacan et al., 2016) (momentum 0.5) for a total of 10
 epochs with the original GAN loss (Isola et al., 2017) and an L1 loss,
 weighted with LGAN , L1 = 1, 100 as in the original study (Goodfellow
 et al., 2014). Batch normalization (Ioffe and Szegedy, 2015) is applied
 to the generator. The initial Niterinit = 5 epochs are trained at a learning
 rate of lrinit = 2·10 4 , followed by Niterdecay = 5 epochs with lambda
 learning rate decaying lrinit by the multiplicative factor
 max (0, 2 + epoch Niterinit )/(Niterdecay + 1) , where epoch
 decay = 1.0
 denotes the number of the current epoch. Both the quantitative results
 presented in Table 1 and the example images shown in Fig. 8 illustrate
 the superiority of the our DSen2-CR approach – especially in terms of
 spectral and structural fidelity.

 Fig. 11. Average of channel-wise nRMSE over all test images.

 4.4. Application of the full model on large scenes

 well adapted in previous studies on cloud removal (Grohnfeldt et al.,
 2018; Bermudez et al., 2018). The architecture of our baseline consists
 of a U-net (Ronneberger et al., 2015) generator and a PatchGAN discriminator (Karacan et al., 2016). The generator takes 13 channel
 multi-spectral optical and dual-polarimetric SAR patches as input, both
 of size 256 × 256 pixels. The discriminator takes as input a concatenation of dual-polarimetric SAR patches, the 13-channel multi-spectral
 cloudy and the real or generated cloud-free patches. SAR patches are
 clipped to values [−25, 0] and rescaled to range [−1, 1]. Optical
 patches are clipped to values [0, 10,000] and rescaled to range [−1, 1].

 For a qualitative evaluation of the operational performance of the
 full DSen2-CR model trained on the CARL loss including the SAR input,
 Fig. 9 shows a selection of large reconstructed scenes, i.e. images larger
 than the 256 × 256 -pixel patches the model was trained and validated
 on. These scenes were concatenated from patches belonging to the holdout test dataset. To assess the reconstruction performance in all optical
 channels, Fig. 10 shows the normalized root-mean-square errors
 (nRMSE) averaged over each optical channel for the pictures shown in
 Fig. 9. The normalized representation was chosen for better

 (a)

 (b)

 (c)

 (d)

 (e)

 (f)

 (g)

 (h)

 (i)

 Fig. 12. 60-m resolution channels (B1, B9, B10) for the second image in Fig. 9. Left column: input image. Central column: prediction. Right column: target image.
 344

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 interpretability, since the absolute RMSE spectra have been observed to
 correlate with the reflectance spectra. Additionally, in Fig. 10 we also
 show spectra of the central pixel of each image. To assess the overall
 band-wise reconstruction quality, averages over all test images of each
 band-wise normalized RMSE are shown in Fig. 11. It can be seen that
 the channels, which experience the overall worst reconstruction
 quality, are B10, followed by B9, and B1 – all of which observe the
 atmosphere rather than the land surface (see Fig. 12).
 Therefore, the performance of the model in reconstructing ground
 information even below large and thick clouds can still be appreciated
 on a large scale. The central pixel of the last image (Fig. 10h) is a cloudy
 pixel, which can be recognized by the high reflectance values of the
 input image. Here it can be seen how the model successfully reconstructs the entire cloud-free pixel spectrum. For the third image
 (Fig. 10f) the reconstructed spectrum is also very close to the target,
 while for the first two images (Figs. 10d and 10b) the reconstruction lies
 between input and target, either due to prediction inaccuracy or due to
 the partial retainment of input information induced by the CARL loss.

 cloud-removal in single-temporal Sentinel-2 satellite imagery. The main
 features of the proposed approach are threefold: On the one hand, we
 have incorporated a data fusion strategy to the cloud removal process in
 order to provide further information about the surface characteristics of
 the target scene based on Sentinel-1 SAR imagery. On the other hand,
 we have proposed a cloud-adaptive loss to circumvent the problem that
 cloud-affected and cloud-free training images can never be acquired at
 the same time. Finally, we have trained our model on a dataset sampled
 across the globe and over all meteorological seasons. Based on a deterministic split of training and test data, our experiments confirm the
 generic applicability of the final cloud-removal model. Both qualitative
 and quantitative results show that both the SAR-optical data fusion
 component and the cloud-adaptive training loss help significantly to
 predict reasonable cloud-free image content. In many cases, the pixel
 spectra are also improved. Due to the free availability of both Sentinel-2
 and Sentinel-1 satellite imagery for all regions of the Earth, it is expected that the presented cloud-removal approach will be beneficial to
 a more temporally seamless monitoring of our environment.

 5. Discussion

 Declaration of Competing Interest

 As the results summarized in Section 4 show, the DSen2-CR network
 is generally capable of removing clouds from Sentinel-2 imagery. This is
 not limited to a purely visual RGB representation of the declouded input
 image, but includes the reconstruction of the whole pixel spectrum with
 an average normalized RMSE between 2% and 20%, depending on the
 band. It should be noted, however, that the worst reconstruction results
 are achieved for the 60 m-bands, which are not meant to observe the
 surface of the Earth, but rather the atmosphere: B10, which shows the
 worst normalized RMSE values, is dedicated to a measurement of Cirrus
 clouds with a short-wave infrared wavelength; B9 is dedicated to
 measuring water vapor, and B1 is supposed to deliver information
 about coastal aerosoles (cf. Fig. 11). Since the SAR auxilary image uses
 a C-band signal with much longer wavelength, it is not affected by those
 atmospheric parameters at all and just provides information about the
 geometrical structure of the Earth surface. This, of course, distorts the
 reconstruction of the atmosphere-related Sentinel-2 bands, as can be
 seen in Fig. 11. However, most classical Earth observation tasks, which
 benefit from a cloud-removal pre-processing step, do not employ those
 bands anyway and restrict their analyses to the 10 m- and 20 m-bands,
 which provide actual measurements of the Earth surface. Thus, the
 inclusion of the SAR auxiliary image can definitely be deemed helpful,
 which is also confirmed by the numerical results listed in Table 1 and
 the qualitative examples shown in Fig. 6: The overall best result with
 respect to pure numbers is achieved when the classic loss T and SARoptical data fusion are used. The new cloud-adaptive loss CARL ,
 however, leads to a much better retainment of the original input and
 introduces less image translation artifacts, which are usually caused by
 training on images with a temporal offset. In summary, the combination
 of SAR-optical data fusion and the cloud-adaptive loss CARL provides
 the results that generalize best to different situations and also provide
 reliable cloud-removal for both rather thick clouds and vegetated areas
 which exhibit phenological changes. In the worst case, i.e. when the
 scene is comprised of complex patterns and the cloud cover is optically
 very thick, the network fails to provide a detailed and fully accurate
 reconstruction (c.f. the urban example in Fig. 6). It has to be stressed
 again, however, that the dataset used for training of the DSen2-CR
 model is globally sampled, which means that the network needs to learn
 a highly complex mapping from SAR to optical imagery for virtually
 every land cover type existing. By restricting the dataset or fine-tuning
 the model to a specific region or land cover type, it is expected that the
 SAR-to-optical translation results would improve significantly.

 None.
 Acknowledgments
 This work was partially supported by the Federal Ministry for
 Economic Affairs and Energy of Germany in the project “AI4Sentinels –
 Deep Learning for the Enrichment of Sentinel Satellite Imagery” (FKZ
 50EE1910). The work of X. Zhu is jointly supported by the European
 Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. [ERC-2016StG-714087], Acronym: So2Sat), Helmholtz Artificial Intelligence
 Cooperation Unit (HAICU) - Local Unit “Munich Unit @Aeronautics,
 Space and Transport (MASTr)” and Helmholtz Excellent Professorship
 “Data Science in Earth Observation - Big Data Fusion for Urban
 Research”.
 References
 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
 A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
 Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R.,
 Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I.,
 Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F., Vinyals, O., Warden,
 P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., 2016. TensorFlow: Large-Scale
 Machine Learning on Heterogeneous Distributed Systems, CoRR abs/1603.0.
 Bermudez, J.D., Happ, P.N., Oliveira, D.A.B., Feitosa, R.Q., 2018. SAR to optical image
 synthesis for cloud removal with generative adversarial networks. ISPRS Annals
 Photogram., Remote Sens. Spatial Inform. Sci., IV-1, 2018, pp. 5–11.
 Bermudez, J.D., Happ, P.N., Feitosa, R.Q., Oliveira, D.A.B., 2019. Synthesis of
 Multispectral Optical Images From SAR/Optical Multitemporal Data Using
 Conditional Generative Adversarial Networks. IEEE Geosci. Remote Sens. Lett. 16,
 1220–1224.
 Cheng, Q., Shen, H., Zhang, L., Yuan, Q., Zeng, C., 2014. Cloud removal for remotely
 sensed images by similar pixel replacement guided with a spatio-temporal MRF
 model. ISPRS J. Photogram. Remote Sens. 92, 54–68.
 Desnos, Y., Borgeaud, M., Doherty, M., Rast, M., Liebig, V., 2014. The European Space
 Agency’s Earth observation program. IEEE Geosci. Remote Sens. Magaz. 2, 37–46.
 Dozat, T., 2015. Incorporating Nesterov momentum into Adam, Technical Report.
 Stanford University.
 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B.,
 Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F.,
 Bargellini, P., 2012. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES
 Operational Services. Remote Sens. Environ. 120, 25–36.
 Eckardt, R., Berger, C., Thiel, C., Schmullius, C., 2013. Removal of optically thick clouds
 from multi-spectral satellite images using multi-frequency SAR data. Remote Sens. 5,
 2973–3006.
 Enomoto, K., Sakurada, K., Wang, W., Fukui, H., Matsuoka, M., Nakamura, R.,
 Kawaguchi, N., 2017. Filmy cloud removal on satellite imagery with multispectral
 conditional Generative Adversarial Nets. In: 2017 IEEE Conference on Computer
 Vision and Pattern Recognition Workshops (CVPRW), volume 14, IEEE, 2017, pp.
 1533–1541.
 Fuentes Reyes, M., Auer, S., Merkle, N., Schmitt, M., 2019. SAR-to-optical image translation based on conditional generative adversarial networks – optimization,

 6. Summary and conclusion
 In this paper, we have presented a deep residual neural network for
 345

 ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

 A. Meraner, et al.

 Pattern Recognition Workshops (CVPRW), IEEE. IEEE, pp. 1132–1140.
 Lin, C.-H., Tsai, P.-H., Lai, K.-H., Chen, J.-Y., 2013. Cloud removal from multitemporal
 satellite images using information cloning. IEEE Trans. Geosci. Remote Sens. 51,
 232–241.
 Lv, H., Wang, Y., Shen, Y., 2016. An empirical and radiative transfer model based algorithm to remove thin clouds in visible bands. Remote Sens. Environ. 179, 183–195.
 Meng, F., Yang, X., Zhou, C., Li, Z., 2017. A sparse dictionary learning-based adaptive
 patch inpainting method for thick clouds removal from high-spatial resolution remote
 sensing imagery. Sensors 17, 2130.
 Mescheder, L., Geiger, A., Nowozin, S., 2018. Which training methods for GANs do actually converge?, CoRR abs/1801.0.
 Mirza, M., Osindero, S., 2014. Conditional Generative Adversarial Nets, CoRR abs/
 1411.1.
 Ramoino, F., Tutunaru, F., Pera, F., Arino, O., 2017. Ten-meter Sentinel-2A cloud-free
 composite—Southern Africa 2016. Remote Sens. 9, 652.
 Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing
 and computer-assisted intervention, Springer, pp. 234–241.
 Schmitt, M., Hughes, L.H., Qiu, C., Zhu, X.X., 2019a. Aggregating cloud-free Sentinel-2
 images with Google Earth Engine. In: ISPRS Annals of the Photogrammetry, Remote
 Sensing and Spatial Information Sciences, volume IV-2/W7, pp. 145–152.
 Schmitt, M., Hughes, L.H., Qiu, C., Zhu, X.X., 2019b. SEN12MS – a curated dataset of
 georeferenced multi-spectral Sentinel-1/2 imagery for deep learning and data fusion.
 In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information
 Sciences, volume IV-2/W7, pp. 153–160.
 Shen, H., Li, X., Cheng, Q., Zeng, C., Yang, G., Li, H., Zhang, L., 2015. Missing information
 reconstruction of remote sensing data: a technical review. IEEE Geosci. Remote Sens.
 Magaz. 3, 61–85.
 Singh, P., Komodakis, N., 2018. Cloud-Gan: cloud removal for Sentinel-2 imagery using a
 cyclic consistent Generative Adversarial Network. In: IGARSS 2018–2018 IEEE
 International Geoscience and Remote Sensing Symposium, pp. 1772–1775.
 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2017. Inception-v4, InceptionResNet and the impact of residual connections on learning. In: Proceedings of the
 Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Inception-v4, pp.
 4278–4284.
 Torres, R., Snoeij, P., Geudtner, D., Bibby, D., Davidson, M., Attema, E., Potin, P.,
 Rommen, B., Floury, N., Brown, M., Traver, I.N., Deghaye, P., Duesmann, B., Rosich,
 B., Miranda, N., Bruno, C., L’Abbate, M., Croci, R., Pietropaolo, A., Huchler, M.,
 Rostan, F., 2012. GMES Sentinel-1 mission. Remote Sens. Environ. 120, 9–24.
 Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E., 2004. Image quality assessment: from
 error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612.
 Xu, M., Pickering, M., Plaza, A.J., Jia, X., 2016. Thin cloud removal based on signal
 transmission principles and spectral mixture analysis. IEEE Trans. Geosci. Remote
 Sens. 54, 1659–1669.
 Xu, M., Jia, X., Pickering, M., Jia, S., 2019. Thin cloud removal from optical remote
 sensing images using the noise-adjusted principal components transform. ISPRS J.
 Photogram. Remote Sens. 149, 215–225.
 Zhai, H., Zhang, H., Zhang, L., Li, P., 2018. Cloud/shadow detection based on spectral
 indices for multi/hyperspectral optical remote sensing imagery. ISPRS J. Photogram.
 Remote Sens. 144, 235–253.
 Zhang, Q., Yuan, Q., Zeng, C., Li, X., Wei, Y., 2018. Missing data reconstruction in remote
 sensing image with a unified spatial–temporal–spectral deep convolutional neural
 network. IEEE Trans. Geosci. Remote Sens. 56, 4274–4288.

 opportunities and limits. Remote Sens. 11, 2067.
 Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
 A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in neural information
 processing systems, pp. 2672–2680.
 Grohnfeldt, C., Schmitt, M., Zhu, X., 2018. A conditional Generative Adversarial Network
 to fuse SAR and multispectral optical data for cloud removal from Sentinel-2 Images.
 In: IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing
 Symposium. IEEE, pp. 1726–1729.
 He, W., Yokoya, N., 2018. Multi-Temporal Sentinel-1 and -2 Data Fusion for Optical
 Image Simulation. ISPRS Int. J. Geo-Inform. 7, 389.
 He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification, CoRR abs/1502.0.
 He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition. In:
 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
 770–778.
 He, K., Zhang, X., Ren, S., Sun, J., 2016b. Identity Mappings in Deep Residual Networks,
 CoRR abs/1603.0.
 Hu, G., Li, X., Liang, D., 2015. Thin cloud removal from remote sensing images using
 multidirectional dual tree complex wavelet transform and transfer least square support vector regression. J. Appl. Remote Sens. 9, 095053.
 Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by
 reducing internal covariate shift, arXiv preprint arXiv:1502.03167.
 Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer
 vision and pattern recognition, pp. 1125–1134.
 Ji, T.-Y., Yokoya, N., Zhu, X.X., Huang, T.-Z., 2018. Nonlocal tensor completion for
 multitemporal remotely sensed images’ inpainting. IEEE Trans. Geosci. Remote Sens.
 56, 3047–3061.
 Karacan, L., Akata, Z., Erdem, A., Erdem, E., 2016. Learning to generate images of outdoor scenes from attributes and semantic layouts, arXiv preprint arXiv:1612.00215
 (2016).
 King, M.D., Platnick, S., Menzel, W.P., Ackerman, S.A., Hubanks, P.A., 2013. Spatial and
 temporal distribution of clouds observed by MODIS onboard the Terra and Aqua
 satellites. IEEE Trans. Geosci. Remote Sens. 51, 3826–3852.
 Kruse, F., Lefkoff, A., Boardman, J., Heidebrecht, K., Shapiro, A., Barloon, P., Goetz, A.,
 1993. The spectral image processing system (SIPS)—interactive visualization and
 analysis of imaging spectrometer data. Remote Sens. Environ. 44, 145–163.
 Lanaras, C., Bioucas-Dias, J., Galliani, S., Baltsavias, E., Schindler, K., 2018. Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network.
 ISPRS J. Photogram. Remote Sens. 146, 305–319.
 Li, Xinghua, Shen, Huanfeng, Zhang, Liangpei, Zhang, Hongyan, Yuan, Qiangqiang, Yang,
 Gang, 2014. Recovering quantitative remote sensing products contaminated by thick
 clouds and shadows using multitemporal dictionary learning. IEEE Trans. Geosci.
 Remote Sens. 52, 7086–7098.
 Li, X., Shen, H., Zhang, L., Li, H., 2015. Sparse-based reconstruction of missing information in remote sensing images from spectral/temporal complementary information. ISPRS J. Photogram. Remote Sens. 106, 1–15.
 Li, X., Wang, L., Cheng, Q., Wu, P., Gan, W., Fang, L., 2019. Cloud removal in remote
 sensing images using nonnegative matrix factorization and error correction. ISPRS J.
 Photogram. Remote Sens. 148, 103–113.
 Li, H., Li, G., Lin, L., Yu, H., Yu, Y., 2019. Context-aware semantic inpainting. IEEE Trans.
 Cybernet. 49, 4398–4411.
 Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M., 2017. Enhanced deep residual networks for
 single image super-resolution. In: 2017 IEEE Conference on Computer Vision