Skip to content

add DBNet++ support #268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 40 additions & 12 deletions configs/det/dbnet/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
English | [中文](README_CN.md)

# DBNet
# DBNet and DBNet++

<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. -->

> [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
> DBNet: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
> DBNet++: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304)

## 1. Introduction

### DBNet

DBNet is a segmentation-based scene text detection method. Segmentation-based methods are gaining popularity for scene
text detection purposes as they can more accurately describe scene text of various shapes, such as curved text.
The drawback of current segmentation-based SOTA methods is the post-processing of binarization (conversion of
Expand All @@ -17,7 +20,7 @@ To eliminate the problem described above, DBNet integrates an adaptive threshold
into the architecture. DB simplifies post-processing and enhances the performance of text detection.Moreover, it can be
removed in the inference stage without sacrificing performance.[[1](#references)]

![dbnet_architecture](https://user-images.githubusercontent.com/16683750/225589619-d50c506c-e903-4f59-a316-8b62586c73a9.png)
<p align="center"><img alt="Figure 1. Overall DBNet architecture" src="https://user-images.githubusercontent.com/16683750/225589619-d50c506c-e903-4f59-a316-8b62586c73a9.png" width="800"/></p>
<p align="center"><em>Figure 1. Overall DBNet architecture</em></p>

The overall architecture of DBNet is presented in _Figure 1._ It consists of multiple stages:
Expand All @@ -32,22 +35,43 @@ The overall architecture of DBNet is presented in _Figure 1._ It consists of mul
5. The probability and threshold maps are merged into one approximate binary map by the Differentiable binarization
module. The approximate binary map is used to generate text bounding boxes.

### DBNet++

DBNet++ is an extension of DBNet and thus replicates its architecture. The only difference is that instead of
concatenating extracted and scaled features from the backbone as DBNet did, DBNet++ uses an adaptive way to fuse those
features called Adaptive Scale Fusion (ASF) module (Figure 2). It improves the scale robustness of the network by
fusing features of different scales adaptively. By using ASF, DBNet++’s ability to detect text instances of diverse
scales is distinctly strengthened.[[2](#references)]

<p align="center"><img alt="Figure 2. Overall DBNet++ architecture" src="https://user-images.githubusercontent.com/16683750/236786997-13823b9c-ecaa-4bc5-8037-71299b3baffe.png" width="800"/></p>
<p align="center"><em>Figure 2. Overall DBNet++ architecture</em></p>

<p align="center"><img alt="Figure 3. Detailed architecture of the Adaptive Scale Fusion module" src="https://user-images.githubusercontent.com/16683750/236787093-c0c78d8f-e4f4-4c5e-8259-7120a14b0e31.png" width="700"/></p>
<p align="center"><em>Figure 3. Detailed architecture of the Adaptive Scale Fusion module</em></p>

ASF consists of two attention modules – stage-wise attention and spatial attention, where the latter is integrated in
the former as described in the Figure 3. The stage-wise attention module learns the weights of the feature maps of
different scales. While the spatial attention module learns the attention across the spatial dimensions. The
combination of these two modules leads to scale-robust feature fusion.
DBNet++ performs better in detecting text instances of diverse scales, especially for large-scale text instances where
DBNet may generate inaccurate or discrete bounding boxes.

## 2. Results

### ICDAR2015

<div align="center">

| **Model** | **Context** | **Backbone** | **Pretrained** | **Recall** | **Precision** | **F-score** | **Train T.** | **Throughput** | **Recipe** | **Download** |
|-------------------|----------------|--------------|----------------|------------|---------------|-------------|--------------|----------------|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DBNet (ours) | D910x1-MS2.0-G | ResNet-18 | ImageNet | 80.40% | 83.71% | 82.02% | 9.3 s/epoch | 108 img/s | [yaml](db_r18_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa-cf46eb8b.mindir) |
| DBNet (ours) | D910x1-MS2.0-G | ResNet-50 | ImageNet | 83.53% | 86.49% | 84.99% | 12.3 s/epoch | 72 img/s | [yaml](db_r50_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24-fbf95c82.mindir) |
| DBNet (PaddleOCR) | - | ResNet50_vd | SynthText | 78.72% | 86.41% | 82.38% | - | - | - | - |
| DBNet++ | D910x1-MS1.9-G | ResNet-50 | ImageNet | 82.02% | 87.38% | 84.62% | - | - | - | - |
| **Model** | **Context** | **Backbone** | **Pretrained** | **Recall** | **Precision** | **F-score** | **Train T.** | **Throughput** | **Recipe** | **Download** |
|---------------------|----------------|---------------|----------------|------------|---------------|-------------|--------------|----------------|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DBNet | D910x1-MS2.0-G | ResNet-18 | ImageNet | 80.40% | 83.71% | 82.02% | 9.3 s/epoch | 108 img/s | [yaml](db_r18_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa-cf46eb8b.mindir) |
| DBNet | D910x1-MS2.0-G | ResNet-50 | ImageNet | 83.53% | 86.49% | 84.99% | 12.3 s/epoch | 72 img/s | [yaml](db_r50_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24-fbf95c82.mindir) |
| DBNet (PaddleOCR) | - | ResNet50_vd | SynthText | 78.72% | 86.41% | 82.38% | - | - | - | - |
| | | | | | | | | | | |
| DBNet++ | D910x1-MS2.0-G | ResNet-50 | SynthText | 85.56% | 87.67% | 86.60% | 17.7 s/epoch | 56 img/s | [yaml](db++_r50_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnetpp_resnet50-068166c2.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnetpp_resnet50-068166c2-76fcb451.mindir) |
| DBNet++ (PaddleOCR) | - | ResNet-50_DCN | SynthText | 82.66% | 90.89% | 86.58% | - | - | - | - |
</div>

> More information of DBNet++ is coming soon. The only difference between _DBNet_ and _DBNet++_ is in the _Adaptive Scale Fusion_ module, which is controlled by the `use_asf` parameter in the `neck` module in yaml config file.


#### Notes
- Context: Training context denoted as {device}x{pieces}-{MS version}{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
Expand Down Expand Up @@ -171,4 +195,8 @@ python tools/eval.py -c=configs/det/dbnet/db_r50_icdar15.yaml

<!--- Guideline: Citation format GB/T 7714 is suggested. -->

[1] Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, Xiang Bai. Real-time Scene Text Detection with Differentiable Binarization. arXiv:1911.08947, 2019
[1] Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, Xiang Bai. Real-time Scene Text Detection with Differentiable
Binarization. arXiv:1911.08947, 2019

[2] Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, Xiang Bai. Real-Time Scene Text Detection with Differentiable
Binarization and Adaptive Scale Fusion. arXiv:2202.10304, 2022
17 changes: 9 additions & 8 deletions configs/det/dbnet/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ DBNet是一种基于分割的场景文本检测算法。在场景文本检测中

为了避免上述问题,DBNet在网络架构中集成了一个叫作“可微分二值化(Differentiable Binarization)”的自适应阈值。可微分二值化简化了后处理过程,增强了文本检测的性能。此外,在推理阶段移除该部分不会使性能降低[[1](#references)]。

![dbnet_architecture](https://user-images.githubusercontent.com/16683750/225589619-d50c506c-e903-4f59-a316-8b62586c73a9.png)
<p align="center"><img alt="Figure 1. Overall DBNet architecture" src="https://user-images.githubusercontent.com/16683750/225589619-d50c506c-e903-4f59-a316-8b62586c73a9.png" width="800"/></p>
<p align="center"><em>图 1. DBNet整体架构图</em></p>

DBNet的整体架构图如图1所示,包含以下阶段:
Expand All @@ -28,16 +28,17 @@ DBNet的整体架构图如图1所示,包含以下阶段:
### ICDAR2015
<div align="center">

| **模型** | **环境配置** | **骨干网络** | **预训练数据集** | **Recall** | **Precision** | **F-score** | **训练时间** | **吞吐量** | **配置文件** | **模型权重下载** |
|-------------------|----------------|-------------|------------|------------|---------------|-------------|--------------|------------|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DBNet (ours) | D910x1-MS2.0-G | ResNet-18 | ImageNet | 80.40% | 83.71% | 82.02% | 9.3 s/epoch | 108 img/s | [yaml](db_r18_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa-cf46eb8b.mindir) |
| DBNet (ours) | D910x1-MS2.0-G | ResNet-50 | ImageNet | 83.53% | 86.49% | 84.99% | 12.3 s/epoch | 72 img/s | [yaml](db_r50_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24-fbf95c82.mindir) |
| DBNet (PaddleOCR) | - | ResNet50_vd | SynthText | 78.72% | 86.41% | 82.38% | - | - | - | - |
| DBNet++ | D910x1-MS1.9-G | ResNet-50 | ImageNet | 82.02% | 87.38% | 84.62% | - | - | - | - |
| **模型** | **环境配置** | **骨干网络** | **预训练数据集** | **Recall** | **Precision** | **F-score** | **训练时间** | **吞吐量** | **配置文件** | **模型权重下载** |
|---------------------|----------------|---------------|------------|------------|---------------|-------------|--------------|-----------|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DBNet | D910x1-MS2.0-G | ResNet-18 | ImageNet | 80.40% | 83.71% | 82.02% | 9.3 s/epoch | 108 img/s | [yaml](db_r18_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet18-0c0c4cfa-cf46eb8b.mindir) |
| DBNet | D910x1-MS2.0-G | ResNet-50 | ImageNet | 83.53% | 86.49% | 84.99% | 12.3 s/epoch | 72 img/s | [yaml](db_r50_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnet_resnet50-c3a4aa24-fbf95c82.mindir) |
| DBNet (PaddleOCR) | - | ResNet50_vd | SynthText | 78.72% | 86.41% | 82.38% | - | - | - | - |
| | | | | | | | | | | |
| DBNet++ | D910x1-MS2.0-G | ResNet-50 | SynthText | 85.56% | 87.67% | 86.60% | 17.7 s/epoch | 56 img/s | [yaml](db++_r50_icdar15.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnetpp_resnet50-068166c2.ckpt) \| [mindir](https://download.mindspore.cn/toolkits/mindocr/dbnet/dbnetpp_resnet50-068166c2-76fcb451.mindir) |
| DBNet++ (PaddleOCR) | - | ResNet-50_DCN | SynthText | 82.66% | 90.89% | 86.58% | - | - | - | - |

</div>

> DBNet++的详细信息即将发布,敬请期待。DBNet和DBNet++的唯一区别在于_Adaptive Scale Fusion_模块, 在yaml配置文件`neck`模块中的 `use_asf`参数进行设置。

#### 注释:
- 环境配置:训练的环境配置表示为 {处理器}x{处理器数量}-{MS模式},其中 Mindspore 模式可以是 G-graph 模式或 F-pynative 模式。
Expand Down
Loading