Recently, the semantic classification (SC) algorithm for remote sensing images (RSI) has been greatly improved by deep learning (DL) techniques, e.g., deep convolutional neural networks (CNNs). However, too many methods employ complex procedures (e.g., multi-stages), excessive hardware budgets (e.g., multi-models), and an extreme reliance on domain knowledge (e.g., handcrafted features) for the pure purpose of improving accuracy. It obviously goes against the superiority of DL, i.e., simplicity and automation. Meanwhile, these algorithms come with unnecessarily expensive overhead on parameters and hardware costs. As a solution, the author proposed a fast and simple training algorithm based on the smallest architecture of EfficientNet version 2, which is called FST-EfficientNet. The approach employs a routine transfer learning strategy and has fast training characteristics. It outperforms all the former methods by a 0.8%–2.7% increase in accuracy. It does, however, use a higher testing resolution of 5122 and 6002, which results in high consumption of graphics processing units (GPUs). As an upgrade option, the author proposes a novel and more efficient method named FST-EfficientNetV2 as the successor. The new algorithm still employs a routine transfer learning strategy and maintains fast training characteristics. But a set of crucial algorithmic tweaks and hyperparameter re-optimizations have been updated. As a result, it achieves a noticeable increase in accuracy of 0.3%–1.1% over its predecessor. More importantly, the algorithm’s GPU costs are reduced by 75%–81%, with a significant reduction in training time costs of 60%–80%. The results demonstrate that an efficient training optimization strategy can significantly boost the CNN algorithm's performance for RSI-SC. More crucially, the results prove that the distribution shift introduced by data augmentation (DA) techniques is vital to the method’s performance for RSI-SC, which has been ignored to date. These findings may help us gain a correct understanding of the CNN algorithm for RSI-SC.

Remote sensing is the primary method for humans to observe the Earth from space, and the interpretation of RSI is crucial in data analysis [

There are different applications of DL methods for RSI, i.e., SC [

Firstly, the CNN model is only treated as a fixed deep feature extractor for traditional ML classifiers owing to the lack of large-scale RSI datasets [

The success of DL mainly results from the automatic feature extraction and the end-to-end training procedure. It replaces the labor-intensive process of feature engineering and reduces the reliance of traditional ML on domain-expert knowledge. Moreover, an abundance of datasets and computational resources for DL is vital, too. Now, computation acceleration in DL is commonly implemented through GPUs, where accessible hardware (e.g., video memory) is always fixed and expensive. To some extent, algorithms do not need to consider hardware budgets for academic purposes. Conversely, algorithms with higher GPU costs, e.g., the ensemble and other combinations of CNNs, are not suitable for routine tasks, especially for industrial implementations. Meanwhile, the fusion types still have too much domain-specific experience, which limits the development of DL. Furthermore, all the aforementioned methods are not completely end-to-end, which corresponds to the lack of an automation advantage. In conclusion, all the previous methods have lost the key advantages of DL, i.e., automation and simplicity.

There is a way to achieve higher accuracy with a simpler procedure and a controllable hardware cost. As the author's prior study concluded [^{2} and 600^{2}, which limit industrial implementations within a restricted hardware budget. Secondly, its performance can be noticeably improved with very limited growth in GPU budgets if some algorithmic modifications and hyper-parameter re-optimizations are made.

Motivated by the above ideas, the author proposes a novel CNN-based framework named FST-EfficientNetV2. The algorithm has fast, simple, and more lightweight characteristics. It is more efficient in training and has better testing accuracy than its predecessor. Meanwhile, the time and hardware costs of the GPU are reduced by 4–5 times compared to its predecessor. In addition, the automation, simplicity, and lower reliance on specific knowledge of its predecessor are retained. That is, the FST-EfficientNetV2 is still a standard and simple transfer learning strategy, with the base EfficientNetV2-S model unchanged [

Firstly, FST-EfficientNetV2 is a novel CNN-based approach that achieves SOTA performance with a 0.3%–1.1% increase in accuracy over its predecessor. The algorithm still employs a routine transfer learning strategy. The model has fewer parameters, at about 22 megabytes (M) with 8.8 billion (B) floating-point operations (FLOPs), and the whole algorithm flowchart is understandable.

Secondly, FST-EfficientNetV2 is more efficient than its predecessor, with higher accuracy and 4–5 times lower GPU and time costs. In particular, the FST-EfficientNetV2 method has achieved an amazing performance with a 1.7%–6.2% increase in accuracy over the other previous SOTA approaches for RSI-SC. Furthermore, the whole framework nearly requires no domain-expert knowledge in remote sensing. This novel algorithm is more suitable for industrial implementations within a restricted hardware budget.

Finally, and most importantly, the FST-EfficientNetV2 practice shares a novel idea for improving the performance and efficiency of CNN methods for RSI-SSC. On the one hand, the results suggest that it is unnecessary to give up the advantages of DL (i.e., simplicity and automation) for the pure purpose of higher accuracy. On the other hand, the author argues that the distribution shift introduced by DA is vital to the method’s performance. It should not be ignored in our feature work.

The remainder of this paper is organized as follows: Section 2 gives a brief review of related works. Section 3 describes the proposed method in detail. Section 4 introduces experimental designs and results. Experimental results are discussed in Section 5. Section 6 presents a conclusion and future work.

Firstly, the fusion of deep CNN features with handcrafted ones is the dominant technical route, which makes the algorithms very complex. Liu et al. fuse deep features of CNNs with handcrafted ones (i.e., the Wasserstein distance, WD) for SC tasks [

Secondly, multi-models or ensembles of CNNs have been widely tested for the RSI-SC accuracy contest. Zhang et al. propose a cascade algorithm that employs two CNNs (i.e., a VGGNet-16 and an Inception-V3) as feature extractors for the feed into a capsule network (CapsNet) [

Ample data is essential for DL-based algorithms. However, acquiring a larger-scale dataset for specific-domain tasks is always difficult. Hence, DA and transfer learning strategies are commonly employed in the training optimization process of DL [

As a pioneer, Howard proposes a “progressive resizing” method by increasing image size in successive training epochs [

Similarly, Hoffer et al. propose a “mix-and-match” method by using stochastic images and batch sizes through random sampling to improve the training speed [

Finally, Touvron et al. demonstrate the existence of a distribution shift between training and testing data, which comes from the random size crop (RSC) transformation, and then propose the “FixRes” method as a solution [

Additionally, some of the latest ideals concerned with CNN’s training optimization have been demonstrated in the “ConvNeXt” work, which can systematically boost the model’s accuracy in image classification [

Methods | Published year | Advantage | Disadvantage |
---|---|---|---|

WD [ |
2018 | Early exploration | Complex handcrafted features |

HWD [ |
2018 | Early exploration | Complex handcrafted features |

ADSSM [ |
2018 | Early exploration | Complex handcrafted features |

CapsNet [ |
2018 | Early exploration | Redundant multi-models |

MCNN [ |
2018 | Early exploration | Redundant multi-models |

ADFF [ |
2019 | Early exploration | Redundant multi-models |

Hydra [ |
2019 | Early exploration | Redundant fourteen CNN-models |

FST-EfficientNet [ |
2022 | Lower training time cost | Higher testing hardware costs |

Progressive learning [ |
2014 | Lower training time cost | Poor accuracy |

Progressive resizing [ |
2021 | Data distribution solved | Higher training time cost |

Mix-and-match [ |
2019 | Lower training time cost | Data distribution unsolved |

FixRes [ |
2020 | Competitive accuracy | Data distribution unsolved |

ConvNeXt [ |
2022 | Competitive accuracy | More parameters (29 M) |

Motivated by all the above ideas, the author first proposed the former STOA FST-EfficientNet for RSI-SC in the DL community. In more detail, a fixed empirical ratio of image size is uniformly employed during the training and testing phases for all different testing resolutions. However, this one-ratio-for-all strategy is not ideal somehow. To be precise, the optimal size of training images transformed by the RSC should be re-determined if the testing resolution changes. Meanwhile, the FST-EfficientNet employs bigger testing resolutions of 500^{2} and 600^{2}, which correspond to higher GPU budgets and training time costs. To be honest, the testing resolution of 256^{2} should be more common and less expensive in terms of routine GPU budgets.

Therefore, in this paper, the author proposes a novel algorithm for acquiring the empirical ratio more accurately. First of all, the Adam-W algorithm and LS technique are also employed in the optimization strategy for hyperparameters. In more detail, previous ideals like feature fusion, multi-models, or ensembles of CNNs are totally excluded in this paper. Furthermore, the “progressive resizing,” “mix-and-match,” and “progressive learning” methods are excluded in view of their time cost and unsolved data distribution. As the ConvNeXt architectures have more parameters, the EfficientNetV2-S model is unchanged in this paper. Finally, only part of the “FixRes” ideal is employed in this paper. That is, only the RSC transformation is conducted in the DA strategies, but no layer is frozen as the RSI dataset is smaller compared to the natural image ones. As the most obvious improvement, a customized empirical ratio for the RSI dataset is included in this paper. As a result, this study achieves a lightweight framework that is both efficient and easy to implement.

The proposed FST-EfficientNetV2 is a one-stage and end-to-end framework that uses a classical transfer-learning strategy without any handcrafted features. At the start of the pipeline, data augmentation consisting of routine transformations is applied to the input image, and then the transformed image is fed into the base model (i.e., the EfficientNetV2-S). In 2021, the EfficientNetV2 architectures outperformed the competition on ImageNet2012. The EfficientNetV2-S is the baseline model, and its architecture consists of two different conv blocks, i.e., the MobileNet conv (MBC) [

The framework of FST-EfficientNetV2 and the structural differences between MBC and FMBC are shown in

As shown on the top of

Algorithm 1 presents the procedures for acquiring the empirical ratio, at which the images generated by the RSC have a more similar distribution to the original ones. It is well known that an image with a higher resolution has more information. Since the original resolutions are varied, the characteristics of the data distribution in every RSI dataset are different. Meanwhile, the “FixRes” study also proves that the empirical ratio results for different testing sizes are different even for the same dataset. To solve this problem, the empirical ratios of FST-EfficientNetV2 have been re-determined for all testing resolutions in every dataset.

Additionally, some hyperparameters are reset for the FST-EfficientNetV2 training strategy. Firstly, the initial learning rate (LR) is reduced by 10 times. The former LR of 0.001 is slightly large for transfer learning. In transfer learning, the iterative process is typically smoother with a lower LR. Secondly, the LS technique is employed for the cross-entropy loss function. The LS value is now 0.1 instead of the former zero. Thirdly, the Adam-W is used instead of the former stochastic gradient descent as the optimizer of the error back-propagation algorithm. The weight decay value is 0.01. Fourthly, a differential training epoch of 120–180 is used in Step 1. The benchmark datasets have a distinct difference in the volume of samples. Commonly, the smaller ones need more iterative steps until model convergence. Finally, a training batch size of 30 is used for all testing resolutions to validate the method’s generalization. A gradient accumulation trick is employed for the training of larger images, if necessary.

The algorithm of the FST-EfficientNetV2 framework is shown in Algorithm 2. In detail, there are several updates to the algorithm compared to its predecessor. To begin, without any testing procedure or frozen layer, the Step 1 training epoch is 120. The 180 epochs of Step 1 are only used on the smaller dataset. Secondly, the size of the training image transformed by the RSC is re-obtained and notably smaller. Thirdly, the initial learning rate is 0.0001 for Step 1 and 0.0001 for Step 2, both with cosine decay. Fourthly, the batch size is a consistent 30 compared to the former's halved one. That is, the total number of iterations is basically the same as its predecessor. Fifthly, the testing epoch is set at an empirical 240. Extensive, repeated experiments have shown the model usually achieves the best accuracy in the interval of 160 to 220. Finally, and more crucially, the testing resolution is reduced from 512 and 600 to 256. It corresponds to 3–5 times the savings in the GPU and time costs.

The DA in this study employs the same strategy as the former FST-EfficientNet, which consists of six kinds of image transformations in a cascade combination. That is, the resize is followed by the RSC, and then there is the color jitter, horizontal flip, vertical flip, and rotation in turn. The RSC is still excluded from Step 2. The implementation of data augmentation is performed on the central processing unit (CPU) via the default code in the PyTorch libraries.

There are two benchmark RSI datasets used in this study, i.e., the Aerial Image Dataset (AID) [^{2} pixels, while NWPU45D has 45 subclasses and 31,500 images with a fixed resolution of 256^{2} pixels. Additionally, AID has 220–420 images per subclass, while NWPU45D has 700 images per subclass. The typical samples from the two datasets are shown in

AID and NWPU45D have been widely used as benchmarks in previous studies. The overall accuracy (OA) and confusion matrix are commonly used as criteria for methods' performance evaluation. In detail, the OA is as described in

The OA is defined as the total number of accurately classified samples (

The confusion matrix is a specific table of the detailed classification result, which visualizes the algorithm's performance. Each row of the matrix represents the sample in an actual class, while each column represents the sample in a predicted class. That is, for each element

Additionally, the training ratios of the two datasets are the same with FST-EfficientNet, i.e., 20% and 50% for AID and 10% and 20% for NWPU45D. The training and testing subsets are completely the same, too.

The experiments were performed on three personal computers equipped with an AMD 5700X CPU, a RTX 2060 GPU with 12 gigabytes (GB) of video memory, and 32 GB of system memory, running PyTorch 1.11.0 with the Compute Unified Device Architecture 11.5 on Win10. The base model is pre-trained on ImageNet 2012. All the experimental results were also averaged over three runs.

The OA results from different RSC sizes of AID are shown in ^{2}) when the RSC size is above 208. However, the result in Step 1 is not a guarantee of better accuracy in Step 2. As shown in ^{2}. The results prove the “one-size-fits-all” strategy used in the author’s prior study has led to a suboptimal starting point for Step 2, except for the testing resolution of 384^{2}. Therefore, the empirical ratios (i.e., the RSC size for Step 1) of AID are re-determined in the interval of 96–240. Finally, the RSC size is reset to 176 for the testing resolution of 256^{2}.

In addition, the empirical ratios of NWPU45D are similar to the AID results, and the experiential figure is not presented to save space. The RSC size for Step 1 of NWPU45D is reset at 176 for the testing resolution of 256^{2}, too. Generally, as the author concludes in FST-EfficientNet, a minor improvement in accuracy is observed when the testing resolution is increased above 256^{2}. Therefore, the FST-EfficientNetV2 is only designed for the testing resolution of 256^{2} on AID and NWPU45D. In other words, the testing resolution is reduced from 600^{2} to 256^{2} on AID, which is an 81.8% decrease in GPU budget. On the other hand, a significant 75% decrease is shown on NWPU45D for the reduction of testing resolution from 512^{2} to 256^{2}.

The OA results of the FST-EfficientNetV2 and other methods on AID are shown in ^{2} and 600^{2} are listed for the FST-EfficientNetV2 and its predecessor. In the column of training ratio (20%), the OA of FST-EfficientNetV2 shows a noticeable advantage over other methods with a 3.1% increase. If compared to its predecessor, FST-EfficientNetV2 still shows a remarkable increase in OA of 0.3%–0.9%. In the column of training ratio (50%), the OA of FST-EfficientNetV2 shows a noticeable advantage over other methods with a 0.9%–6.2% increase. However, FST-EfficientNetV2 only shows a minor increase of 0.6% over its predecessor at the testing resolution of 256^{2}. Nonetheless, when the following factors are considered, the FST-EfficientNetV2 does outperform its predecessor:

Methods | Training ratio | ||
---|---|---|---|

20% | 50 (%) | ||

WD [ |
– | 97.24 ± 0.32 | |

HWD [ |
– | 96.98 ± 0.33 | |

CapsNet [ |
93.79 ± 0.13 | 96.32 ± 0.12 | |

MCNN [ |
– | 91.80 ± 0.22 | |

ADFF [ |
93.68 ± 0.29 | 94.75 ± 0.24 | |

FST-EfficientNet [ |
256^{2} |
95.74 ± 0.02 | 97.18 ± 0.20 |

600^{2} |
96.37 ± 0.03 | 98.01 ± 0.22 | |

^{2} |

Firstly, there is always artificial noise and a covariate shift in the dataset. The OA of 97.79% of FST-EfficientNetV2 is very close to the 98.01% of its predecessor at the testing ratio of 50%. Secondly, the FST-EfficientNetV2 employs a noticeably smaller resolution of 256^{2} than its predecessor, which is one of 600^{2}. The FST-EfficientNetV2 consumes 81.8% less GPU memory, and its training is five times faster than that of its predecessor. More crucially, even if tested at a resolution of 256^{2}, the FST-EfficientNetV2 still performs better with fewer training samples than its predecessor, which is one of 600^{2}. It indicates that FST-EfficientNetV2 has better representation ability.

Therefore, FST-EfficientNetV2 is a novel SOTA method for the classification of AID. In terms of GPU budget and time cost, the algorithm is faster and less expensive.

The confusion matrixes of AID at 20% training ratios are shown in

The confusion matrixes of AID at 50% training ratios are shown in

In brief, the confusion results of FST-EfficientNetV2 are consistent with its predecessor and other prior studies. Unlikely, the OA results of FST-EfficientNetV2 are much better than the other prior methods. Meanwhile, the FST-EfficientNetV2 has shown a significant improvement in OA by using fewer training samples.

The OA results of the FST-EfficientNetV2 and other methods on NWPU45D are shown in ^{2} and 512^{2} are listed for the FST-EfficientNetV2 and its predecessor. In the column of 10% training ratio, the OA of FST-EfficientNetV2 shows a noticeable advantage over other methods with a 1.8%–5.2% increase. Over its predecessor, FST-EfficientNetV2 still shows a remarkable increase in OA of 0.9%–1.1%.

Methods | Training ratio | ||
---|---|---|---|

10% | 20 (%) | ||

HWD [ |
– | 93.27 ± 0.17 | |

ADSSM [ |
91.69 ± 0.22 | 94.29 ± 0.14 | |

CapsNet [ |
89.03 ± 0.21 | 92.60 ± 0.11 | |

ADFF [ |
90.58 ± 0.19 | 91.91 ± 0.23 | |

Hydra [ |
92.44 ± 0.34 | 94.51 ± 0.21 | |

FST-EfficientNet [ |
256^{2} |
93.16 ± 0.14 | 95.28 ± 0.28 |

512^{2} |
93.74 ± 0.04 | 95.60 ± 0.08 | |

^{2} |

In the column of 50% training ratio, the OA of FST-EfficientNetV2 shows a noticeable advantage over other methods with a 1.2%–3.8% increase. In particular, FST-EfficientNetV2 shows a minor increase of 0.15% over its predecessor at the testing resolution of 512^{2}. In other words, the FST-EfficientNetV2 does outperform its predecessor with a significant four-fold decrease in GPU budget. As a result, the savings in time cost are 55% for the 10% training ratio and 60% for the 20% training ratio.

Additionally, the advancement of FST-EfficientNetV2 is more remarkable if compared to the other prior methods. The ADSSM achieves an acceptable performance with complex handcrafted features. Moreover, the Hydra achieves an SOTA performance before 2020 by using an ensemble of 14 CNN models. However, these typical ideals have now been proven to be suboptimal options. That is, it is unnecessary to abandon the superiority of DL, i.e., simplicity and automation.

The confusion matrixes of NWPU45D at 10% training ratios are shown in ^{2} compared to the one in AID at 600^{2}, though there are more samples in NWPU45D. Additionally, there are too many samples that consist of features from several subclasses, even if only considering the subclass name.

The confusion matrixes of NWPU45D at 20% training ratios are shown in

In brief, the confusion results of FST-EfficientNetV2 are consistent with its predecessor and other prior studies. More importantly, the OA and confusion results of FST-EfficientNetV2 on the two benchmark datasets have shown a greatly improved performance over the other prior methods. Meanwhile, with fewer training samples, the FST-EfficientNetV2 improves more significantly. It implies better representational ability.

DA is now widely employed in DL-based methods for RSI-SC tasks. However, the problems of distribution shift arising from the indispensable technique of DL have seldom been investigated before the author’s work. More crucially, the author’s work only scratches the surface of dilemmas that have long existed in prior works. The OA results at different epochs in the interval of 60 to 240 of Step 1 on AID are shown in

Training ratio | OA (%) | |||
---|---|---|---|---|

60 epochs | 120 epochs | 180 epochs | 240 epochs | |

20% | 96.48 ± 0.12 | 96.56 ± 0.11 | 96.66 ± 0.09 | 96.21 ± 0.08 |

To the best of the author’s knowledge, these problems of distribution shift introduced by data augmentation have not been noticed in the previous DL-based works for RSI-SC to date. These issues may result in a suboptimal or unfair outcome for the previous methods.

Nevertheless, there are still some limitations in our study. Firstly, the author’s work has not been tested in the other STOA CNN architecture, and the aforementioned problems have not been proven yet. Secondly, the spatial attention mechanism has not been involved in the base model of FST-EfficientNetV2. It means an optimal successor only with some low-cost modification.

The interpretation of RSI has been a labor-intensive task owing to the explosion of on-board sensors in the past two decades. Hence, the automatic techniques for RSI’s interpretation are becoming more crucial than ever. The DL-based methods, particularly the CNN methods, have played a key role to date. However, too many CNN methods employ complex, incomprehensible, and costly strategies to achieve better accuracy for RSI-SC. In fact, it is unnecessary.

To address this problem, the author proposes a novel and efficient method named FST-EfficientNetV2. The algorithm employs a routine transfer learning strategy corresponding to fast training characteristics. Some crucial algorithmic tweaks and hyperparameter re-optimizations are updated. It achieves a noticeable increase in accuracy of 0.3%–1.1% over its predecessor. More importantly, GPU hardware costs are reduced by 75%–81%, with training time costs reduced by 60%–80%. It achieves an amazing performance with a 1.7%–6.2% increase in accuracy compared to the other former SOTA methods. More interestingly, the new algorithm only differs from its predecessor by a few algorithmic tweaks and hyper-parameter re-optimizations.

The results demonstrate that an efficient CNN method for RSI-SC can be achieved with simplicity and automation. On the other hand, this study also proves the importance of training optimization strategies for RSI-SC. More crucially, the distribution shift introduced by DA techniques has been proven to be vital to the method’s performance for RSI-SC. In other words, the data distribution shift may cause previous studies to be incorrect to some extent.

In addition, the FST-EfficientNetV2 ideals may have good generality in different CNN architectures for achieving optimal performance for RSI-SC. We will investigate all similar questions in the future.

Thanks to the anonymous reviewers for their valuable suggestions.

The author declares that he has no conflicts of interest to report regarding the present study.