Danxu Liu1,4 *, Di Wang2,4 *, Hebaixu Wang2,4 *, Haoyang Chen2,4 *, Wentao Jiang2, Yilin Cheng3,4, Haonan Guo2,4, Wei Cui1 β , Jing Zhang2,4 β .
1 Beijing Institute of Technology, 2 Wuhan University, 3 Fudan University, 4 Zhongguancun Academy.
* Equal contribution. β Corresponding authors.
Update | Abstract | Datasets | Models | Usage | Statement
2025.12.19
- The paper is post on arXiv! (arXiv SARMAE)
Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks.
Figure 1. Overview of the SARMAE pretraining framework. The framework consists of two branches: (i) a SAR branch following the MAE architecture with Speckle-Aware Representation Enhancement (SARE) to handle inherent speckle noise, and (ii) an optical branch using a frozen DINOv3 encoder. For paired SAR-optical data, Semantic Anchor Representation Constraint (SARC) aligns SAR features with semantic-rich optical representations. Unpaired SAR images are processed solely through the SAR branch.
Figure 2. The organization of data sources in SAR-1M.
Coming Soon.
Coming Soon.
Figure 3. SARMAE outperforms SOTA methods on multiple datasets. 1: 40-SHOT; 2: 30% labeled. a: Multi-classes; b: Water.
| Method | FUSAR-SHIP | MSTAR | SAR-ACD | ||
|---|---|---|---|---|---|
| 40-shot | 30% | 40-shot | 30% | 30% | |
| ResNet-50 | - | 58.41 | - | 89.94 | 59.70 |
| Swin Transformer | - | 60.79 | - | 82.97 | 67.50 |
| Bet | 59.70 | 71.13 | 40.70 | 69.75 | 79.77 |
| LoMaR | 82.70 | - | 77.00 | - | - |
| SAR-JEPA | 85.80 | - | 91.60 | - | - |
| SUMMIT | - | 71.91 | - | 98.39 | 84.25 |
| SARMAE(ViT-B) | 89.30 | 92.92 | 96.70 | 99.61 | 95.06 |
| SARMAE(ViT-L) | 90.86 | 92.80 | 97.24 | 98.92 | 95.63 |
Table 1. Performance comparison (Top1 Accuracy, %) of different methods on the target classification task.
| Method | SARDet-100k | SSDD | Method | RSAR |
|---|---|---|---|---|
| ImageNet | 52.30 | 66.40 | RoI Transformer | 35.02 |
| Deformable DETR | 50.00 | 52.60 | Def. DETR | 46.62 |
| Swin Transformer | 53.80 | 40.70 | RetinaNet | 57.67 |
| ConvNeXt | 55.10 | - | ARS-DETR | 61.14 |
| CATNet | - | 64.66 | R3Det | 63.94 |
| MSFA | 56.40 | - | ReDet | 64.71 |
| SARAFE | 57.30 | 67.50 | O-RCNN | 64.82 |
| SARMAE(ViT-B) | 57.90 | 68.10 | SARMAE(ViT-B) | 66.80 |
| SARMAE(ViT-L) | 63.10 | 69.30 | SARMAE(ViT-L) | 72.20 |
Table 2. Performance comparison (mAP, %) of different methods on horizontal and oriented object detection tasks.
| Method | Multiple classes | Water | ||||||
|---|---|---|---|---|---|---|---|---|
| Industrial Area | Natural Area | Land Use | Water | Housing | Other | mIoU | IoU | |
| FCN | 37.78 | 71.58 | 1.24 | 72.76 | 67.69 | 39.05 | 48.35 | 85.95 |
| ANN | 41.23 | 72.92 | 0.97 | 75.95 | 68.40 | 56.01 | 52.58 | 87.32 |
| PSPNet | 33.99 | 72.31 | 0.93 | 76.51 | 68.07 | 57.07 | 51.48 | 87.13 |
| DeepLab V3+ | 40.62 | 70.67 | 0.55 | 72.93 | 69.96 | 34.53 | 48.21 | 87.53 |
| PSANet | 40.70 | 69.46 | 1.33 | 69.46 | 68.75 | 32.68 | 47.14 | 86.18 |
| DANet | 39.56 | 72.00 | 1.00 | 74.95 | 67.79 | 56.28 | 39.56 | 89.29 |
| SARMAE(ViT-B) | 65.87 | 75.65 | 29.20 | 84.01 | 73.23 | 71.21 | 66.53 | 92.31 |
| SARMAE(ViT-L) | 65.84 | 78.04 | 29.47 | 87.12 | 75.22 | 69.34 | 67.51 | 93.06 |
Table 3. Performance comparison of semantic segmentation methods on multiple classes and water classes.
If you find SARMAE helpful, please give a β and cite it as follows:
@misc{liu2025sarmaemaskedautoencodersar,
title={SARMAE: Masked Autoencoder for SAR Representation Learning},
author={Danxu Liu and Di Wang and Hebaixu Wang and Haoyang Chen and Wentao Jiang and Yilin Cheng and Haonan Guo and Wei Cui and Jing Zhang},
year={2025},
eprint={2512.16635},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16635},
}
For any other questions please contact Danxu Liu at bit.edu.cn or gmail.com.


