This is the code base for "CS-Net: Conv-Simpleformer Network for agricultural image segmentation"
, and the article is published in Pattern Recognition
.
Incomplete sensor networks and long image acquisition cycles in agriculture cause inferior quality of agricultural datasets, limiting traditional Transformer models' effectiveness.
Therefore, we propose the Conv-Simpleformer Network (CS-Net)
combining CNN and Simpleformer.
Then, we devise a lightweight Simple-Attention Block (SIAB)
to address the quadratic correlation between Self-Attention's computational complexity and image size
. SIAB encompasses channel-by-channel and spatial computation branches, with computational complexity linearly linked to image size
.
Further, cascading FFN and SIAB forms a Simpleformer that weakens the attention bias of SIAB
.
In addition, the introduction of CNN motivates SIAB to learn the abstract hidden-layer connections of the target's high dimensionality
, which reduces the dependence of Transformers on the datasets
.
Finally, we conduct comparison and ablation experiments on agricultural image datasets, and the results show that CS-Net has the best performance and strikes a good balance between model complexity, computational efficiency, and performance
.
Although self-attention block(SEAB) effectively encodes spatial information and extracts global information, the similarity score matrix A has a shape of NxN, which causes the computational complexity of SEAB to grow quadratically with the increase in input image size.
Element-by-element computation in SEAB is redundant, so we design an efficient and lightweight attention block called Simple-Attention Block (SIAB). SIAB converts the element-by-element computation operation of SEAB into channel-by-channel and spatial computation branches
.
SIAB effectively solves the problem that the computational complexity of SEAB increases quadratically with the size of the input image. In semantic segmentation tasks, the input image size is usually relatively large. Suppose the size of the input image is [Ci, Hi, Wi] = [3, 512, 512], kH=kW=3, and the input features of the third CS Module in the CS-Net encoder are [C, H, W] = [256, 128, 128]. The computational complexity OmigaSIAB of SIAB is 0.42 G, and the computational complexity OmigaSEAB of SEAB is 141.73 G. OmigaSEAB is 337.5 times larger than OmigaSIAB, so SIAB significantly reduces the demand for computational resources.
Referring to the design idea of the classical Transformer module, to enhance the modeling ability of the location information of the target to be detected and to enrich the feature representation ability, the FFN is cascaded after the SIAB to form a Simpleformer block. Furthermore, we design a CS module by cascading CNN with Simpleformer.
The overall architecture of CS-Net, is in the shape of a letter "V" and consists of two parts: an encoder and a decoder, which generate segmentation results end-to-end.
CS-Net extracts the global representation and local features of the input image and retains the detailed features of the image through the combination of encoder and decoder, thus improving the accuracy and efficiency of image segmentation.
To learn more detail please read our paper.
The quantitative comparison between models that have been applied more frequently in agricultural semantic segmentation and the proposed model.
Segmentation results for all models in scenario 1. Throughput was obtained on RTX 4090 GPU.
Test | Model | Input size | #params(M) | FLOPs(G) | Throughput(Image/s) | Memory usage(M) | MIoU | PA |
---|---|---|---|---|---|---|---|---|
Scenario 1 | U-Net | 512×512 | 13.4 | 124.5 | 156.02 | 34428 | 88.7% | 97.9% |
Scenario 1 | DeepLab v3+ | 512×512 | 14.8 | 295.9 | 135.60 | 34385 | 88.9% | 97.8% |
Scenario 1 | SegNet | 512×512 | 31.9 | 174.9 | 143.61 | 34405 | 83.8% | 96.5% |
Scenario 1 | Segmenter | 512×512 | 25.0 | 409.4 | 0.37 | 37219 | 17.9% | 71.6% |
Scenario 1 | Segformer | 512×512 | 18.5 | 173.1 | 0.08 | 34058 | 42.7% | 80.4% |
Scenario 1 | CPU-Net | 512×512 | 21.8 | 132.1 | 72.63 | 34514 | 82.5% | 95.1% |
Scenario 1 | MUNet | 512×512 | 14.2M | 210.4 | 89.13 | 34313 | 88.5% | 97.7% |
Scenario 1 | CS-Net | 512×512 | 17.9 | 157.1 | 123.14 | 5017 | 89.7% | 98.1% |
Segmentation results for all models in scenario 2.
Test | Model | Input size | #params(M) | FLOPs(G) | Throughput(Image/s) | Memory usage(M) | MIoU | PA |
---|---|---|---|---|---|---|---|---|
Scenario 2 | U-Net | 800×600 | 13.4 | 227.9 | 30.72 | 35231 | 83.7% | 99.0% |
Scenario 2 | DeepLab v3+ | 800×600 | 14.8 | 541.4 | 39.24 | 35136 | 83.9% | 99.1% |
Scenario 2 | SegNet | 800×600 | 31.9 | 318.8 | 28.7 | 35112 | 80.4% | 98.8% |
Scenario 2 | Segmenter | 512×512 | 25.0 | 409.4 | 1.39 | 37215 | 48.7% | 97.5% |
Scenario 2 | Segformer | 512×512 | 18.5 | 173.1 | 0.06 | 34403 | 66.8% | 97.8% |
Scenario 2 | CPU-Net | 800×600 | 21.8 | 241.5 | 10.72 | 35264 | 83.0% | 98.9% |
Scenario 2 | MUNet | 800×600 | 14.2M | 385.1 | 14.31 | 35063 | 41.9% | 99.1% |
Scenario 2 | CS-Net | 800×600 | 17.9 | 287.5 | 21.24 | 9011 | 84.2% | 99.2% |
Segmentation results for all models in scenario 3.
Test | Model | Input size | #params(M) | FLOPs(G) | Throughput(Image/s) | Memory usage(M) | MIoU | PA |
---|---|---|---|---|---|---|---|---|
Scenario 3 | U-Net | 672×376 | 13.4 | 119.9 | 169.31 | 34394 | 84.2% | 97.5% |
Scenario 3 | DeepLab v3+ | 672×376 | 14.8 | 284.9 | 140.80 | 34352 | 86.0% | 98.3% |
Scenario 3 | SegNet | 672×376 | 31.9 | 167.4 | 142.70 | 34425 | 78.9% | 96.9% |
Scenario 3 | Segmenter | 672×376 | 25.0 | 394.6 | 5.11 | 34860 | 47.2% | 94.4% |
Scenario 3 | Segformer | 512×512 | 18.5 | 173.0 | 0.03 | 34058 | 48.5% | 94.6% |
Scenario 3 | CPU-Net | 672×376 | 21.8 | 127.0 | 74.47 | 34480 | 77.6% | 96.9% |
Scenario 3 | MUNet | 672×376 | 14.2 | 202.7 | 88.41 | 34279 | 86.0% | 97.8% |
Scenario 3 | CS-Net | 672×376 | 17.9 | 151.3 | 145.35 | 4840 | 88.7% | 98.4% |
Segmentation results for all models in scenario 4.
Test | Model | Input size | #params(M) | FLOPs(G) | Throughput(Image/s) | Memory usage(M) | MIoU | PA |
---|---|---|---|---|---|---|---|---|
Scenario 4 | U-Net | 512×512 | 13.4 | 124.5 | 153.70 | 34426 | 76.1% | 91.3% |
Scenario 4 | DeepLab v3+ | 512×512 | 14.8 | 295.9 | 139.24 | 34384 | 75.8% | 91.4% |
Scenario 4 | SegNet | 512×512 | 31.9 | 174.9 | 142.25 | 34403 | 74.8% | 91.5% |
Scenario 4 | Segmenter | 512×512 | 25.0 | 409.4 | 4.57 | 37215 | 41.8% | 83.5% |
Scenario 4 | Segformer | 512×512 | 18.5 | 173.1 | 0.08 | 34058 | 57.0% | 83.5% |
Scenario 4 | CPU-Net | 512×512 | 21.8 | 132.1 | 72.39 | 34512 | 72.9% | 89.9% |
Scenario 4 | MUNet | 512×512 | 14.2 | 210.4 | 87.66 | 34311 | 75.4% | 89.9% |
Scenario 4 | CS-Net | 512×512 | 17.9 | 157.1 | 124.71 | 5015 | 81.4% | 93.9% |
To learn more detail please read our paper.
Cross-datasets evaluation results for all control models in scenario 4.
Model | Input size | MIoU | PA |
---|---|---|---|
U-Net | 512×512 | 76.2% | 95.1% |
DeepLab v3+ | 512×512 | 75.2% | 94.8% |
SegNet | 512×512 | 75.4% | 95.9% |
Segmenter | 512×512 | 44.9% | 89.8% |
Segformer | 512×512 | 67.5% | 89.9% |
CPU-Net | 512×512 | 57.3% | 91.7% |
MUNet | 512×512 | 76.3% | 95.6% |
CS-Net | 512×512 | 79.3% | 96.5% |
To learn more detail please read our paper.
The quantitative results of detail experiments.
Model | Weight Strategy | MIoU |
---|---|---|
Baseline 1 | No layernorm | 88.3% |
Model 1-1 | Add LayerNorm to the spatial computation branch | 87.2% |
Model 1-2 | Add LayerNorm to the channel-by-channel computation branch | 89.4% |
Model 1-3 | Add LayerNorm to two branches | 89.7% |
To learn more detail please read our paper.
The results of the block ablation experiments.
Model | CNN | SIAB | FFN | SEAB | Num_head | Patch size | #params(M) | GFLOPs | MIoU(%) |
---|---|---|---|---|---|---|---|---|---|
Model 2-1 | √ | - | - | - | - | - | 12.4 | 119.8 | 88.2 |
Model 2-2 | - | √ | √ | - | 2 | - | 11.8 | 111.4 | 88.0 |
Model 2-3 | √ | √ | √ | - | 2 | - | 17.9 | 157.1 | 89.7 |
Model 2-4 | √ | √ | - | - | 2 | - | 13.0 | 120.2 | 87.0 |
Model 2-5 | - | - | √ | √ | 2 | 16 | 13.6 | 129.3 | 34.6 |
Model 2-6 | √ | - | √ | √ | 2 | 16 | 19.7 | 175.1 | 49.9 |
Model 2-7 | √ | - | - | √ | 2 | 16 | 14.8 | 138.2 | 59.5 |
To learn more detail please read our paper.
The results of the branch ablation experiments.
Model | Channel-by-channel computation branch | Spatial computation branch | #params(M) | GFLOPs | MIoU(%) |
---|---|---|---|---|---|
Model 3-1 | √ | - | 17.9 | 117.8 | 46.9 |
Model 3-2 | - | √ | 17.3 | 157.1 | 89.0 |
baseline 3 | √ | √ | 17.9 | 157.1 | 89.7 |
- The code has been written in Python (3.9.16) and requires pyTorch (version 2.0.1)
- Install the dependencies using pip.
- argparse \ numpy \ pillow \ opencv-python \ torchsummary \ pathlib \ matplotlib \ tensorboardX \ thop
You have to split your data into three folders: train/val/test. Each folder will contain two sub-folders: images and masks, which contain the png files for the images and their corresponding ground truths. The default working path of this project is "Data\aug".
To train a CS-Net on your dataset from scratch, run: train.py
To train a CS-Net on your dataset from scratch, run:
python -m torch.distributed.launch --nproc_per_node=1 train_linux.py
Note that main_linux.py only supports single-machine, single-card training.
@article{liu2024cs,
title={CS-net: Conv-simpleformer network for agricultural image segmentation},
author={Liu, Lei and Li, Guorun and Du, Yuefeng and Li, Xiaoyu and Wu, Xiuheng and Qiao, Zhi and Wang, Tianyi},
journal={Pattern Recognition},
volume={147},
pages={110140},
year={2024},
publisher={Elsevier}
}