Skip to content

HIT-SIRS/CroBIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation

    1Harbin Institute of Technology

arXiv

🗓️ TODO

  • Release code and models of our methods.
  • [2024.10.11] We release the RISBench, a large-scale Vision-Language Benchmark for Referring Remote Sensing Image Segmentation.

📖 Abstract

Flowchart of CroBIM.

Given a natural language expression and a remote sensing image, the goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression. In contrast to natural scenarios, expressions in RRSIS often involve complex geospatial relationships, with target objects of interest that vary significantly in scale and lack visual saliency, thereby increasing the difficulty of achieving precise segmentation. To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM). Specifically, a context-aware prompt modulation (CAPM) module is designed to integrate spatial positional relationships and task-specific knowledge into the linguistic features, thereby enhancing the ability to capture the target object. Additionally, a language-guided feature aggregation (LGFA) module is introduced to integrate linguistic information into multi-scale visual features, incorporating an attention deficit compensation mechanism to enhance feature aggregation. Finally, a mutual-interaction decoder (MID) is designed to enhance cross-modal feature alignment through cascaded bidirectional cross-attention, thereby enabling precise segmentation mask prediction. To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets. Extensive benchmarking on RISBench and two other prevalent datasets demonstrates the superior performance of the proposed CroBIM over existing state-of-the-art (SOTA) methods.

📗Datasets

VRSBench is a Versatile Vision-Language Benchmark for Remote Sensing Image Understanding.

RISBench is a large-scale Vision-Language Benchmark for Referring Remote Sensing Image Segmentation. It comprises 52,472 high-quality image-language label triplets. Each image in RISBench is uniformly sized at 512x512 pixels, maintaining consistency across the dataset. The spatial resolution of the images spans from 0.1m to 30m, encompassing a diverse range of scales and details. The semantic labels are categorized into 26 distinct classes, each annotated with 8 attributes, thereby facilitating a comprehensive and nuanced semantic segmentation analysis.

The dataset can be downloaded from Baidu Netdisk (access code: wnxg).

🍺 Visualizations

Flowchart of CroBIM.

❤️Licensing Information

The dataset is released under the CC-BY-4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

📜 Citation

if you find it helpful, please cite

@article{dong2024cross,
  title={Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation},
  author={Zhe Dong, Yuzhe Sun, Yanfeng Gu and Tianzhu Liu},
  journal={arXiv:2410.08613},
  year={2024}
}

🙏 Acknowledgement

Our RISBench dataset is built based on VRSBench, DOTA-v2 and DIOR datasets.

We are thankful to LAVT, and RMSIN for releasing their models and code as open-source contributions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published