Skip to content

Add CRNN Readme #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 16, 2023
101 changes: 101 additions & 0 deletions configs/rec/crnn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# CRNN
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. -->

> [An End-to-End Trainable Neural Network for Image-based Sequence
Recognition and Its Application to Scene Text Recognition](https://https://arxiv.org/abs/1507.05717)

## Introduction
<!--- Guideline: Introduce the model and architectures. Cite if you use/adopt paper explanation from others. -->

Convolutional Recurrent Neural Network (CRNN) integrates CNN feature extraction and RNN sequence modeling as well as transcription into a unified framework.

As shown in the architecture graph (Figure 1), CRNN firstly extracts a feature sequence from the input image via Convolutional Layers. After that, the image is represented by a squence extracted features, where each vector is associated with a receptive field on the input image. For futher process the feature, CRNN adopts Recurrent Layers to predict a label distribution for each frame. To map the distribution to text field, CRNN adds a Transcription Layer to translate the per-frame predictions into the final label sequence. [<a href="#references">1</a>]

<!--- Guideline: If an architecture table/figure is available in the paper, put one here and cite for intuitive illustration. -->

<p align="center">
<img src="https://user-images.githubusercontent.com/26082447/224601239-a569a1d4-4b29-4fa8-804b-6690cb50caef.PNG" width=450 />
</p>
<p align="center">
<em> Figure 1. Architecture of CRNN [<a href="#references">1</a>] </em>
</p>

## Results
<!--- Guideline:
Table Format:
- Model: model name in lower case with _ seperator.
- Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
- Top-1 and Top-5: Keep 2 digits after the decimal point.
- Params (M): # of model parameters in millions (10^6). Keep 2 digits after the decimal point
- Recipe: Training recipe/configuration linked to a yaml config file. Use absolute url path.
- Download: url of the pretrained model weights. Use absolute url path.
-->

According to our experiments, the evaluation results on public benchmark datasets (IC03, IC13, IC15, IIIT, SVT, SVTP, CUTE) is as follow:

<div align="center">

| Model| Backbone | Config | Avg Accuracy | Download |
|------|----------|--------|--------------|----------|
| CRNN | VGG7 | [crnn_vgg7.yaml](./crnn_vgg7.yaml) | 82.03 | [model_weights]() |
| CRNN | ResNet34 | [crnn_resnet34.yaml](./crnn_resnet34.yaml) | 84.45 | [model_weights]() |


</div>

#### Notes
- Both VGG and ResNet models are trained from scratch without any pre-training.
- The above models are trained with MJSynth (MJ) and SynthText (ST) datasets. For more data details, please refer to [Data Preparation](#dataset-preparation)
- Evaluations are tested individually on each benchmark dataset, and Avg Accuracy is the average of accuracies across all sub-datasets.


## Quick Start
### Preparation

#### Installation
Please refer to the [installation instruction](https://github.com/mindspore-lab/mindocr#installation) in MindOCR.

#### Dataset Preparation
Please download lmdb dataset for traininig and evaluation from [here](https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0) (ref: [deep-text-recognition-benchmark](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here)). There're several zip files:
- `data_lmdb_release.zip` contains the entire datasets including train, valid and evaluation.
- `validation.zip` is the union dataset for Validation
- `evaluation.zip` contains several benchmarking datasets.

### Training
<!--- Guideline: Avoid using shell script in the command line. Python script preferred. -->

* Distributed Training

It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please modify the configuration parameter **distribute** as **True** and run

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python tools/train.py --config configs/rec/crnn/crnn_resnet34.yaml
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

**Note:** As the global batch size (batch_size x num_devices) is an important hyper-parameter, it is recommended to keep the global batch size unchanged for reproduction or adjust the learning rate linearly to a new global batch size.

* Standalone Training

If you want to train or finetune the model on a smaller dataset without distributed training, please modify the configuration parameter **distribute** as **False** and run:

```shell
# standalone training on a CPU/GPU/Ascend device
python tools/train.py --config configs/rec/crnn/crnn_resnet34.yaml
```

### Evaluation

To evaluate the accuracy of the trained model, you can use `eval.py`. Please **add** an additional configuration parameter **ckpt_load_path** in `eval` section and set it to the path of the model checkpoint and then run:

```
python tools/eval.py --config configs/rec/crnn/crnn_vgg7.yaml
```

## References
<!--- Guideline: Citation format GB/T 7714 is suggested. -->

[1] Baoguang Shi, Xiang Bai, Cong Yao. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. arXiv preprint arXiv:1507.05717, 2015.
101 changes: 101 additions & 0 deletions configs/rec/crnn/README_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# CRNN
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. -->

> [An End-to-End Trainable Neural Network for Image-based Sequence
Recognition and Its Application to Scene Text Recognition](https://https://arxiv.org/abs/1507.05717)

## 模型描述
<!--- Guideline: Introduce the model and architectures. Cite if you use/adopt paper explanation from others. -->

卷积递归神经网络 (CRNN) 将 CNN 特征提取和 RNN 序列建模以及转录集成到一个统一的框架中。

如架构图(图 1)所示,CRNN 首先通过卷积层从输入图像中提取特征序列。由此一来,图像由提取的序列特征图表示,其中每个向量都与输入图像上的感受野相关联。 为了进一步处理特征,CRNN 采用循环神经网络层来预测每个帧的标签分布。为了将分布映射到文本字段,CRNN 添加了一个转录层,以将每帧预测转换为最终标签序列。 [<a href="#references">1</a>]

<!--- Guideline: If an architecture table/figure is available in the paper, put one here and cite for intuitive illustration. -->

<p align="center">
<img src="https://user-images.githubusercontent.com/26082447/224601239-a569a1d4-4b29-4fa8-804b-6690cb50caef.PNG" width=450 />
</p>
<p align="center">
<em> 图1. CRNN架构图 [<a href="#references">1</a>] </em>
</p>

## 评估结果
<!--- Guideline:
Table Format:
- Model: model name in lower case with _ seperator.
- Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
- Top-1 and Top-5: Keep 2 digits after the decimal point.
- Params (M): # of model parameters in millions (10^6). Keep 2 digits after the decimal point
- Recipe: Training recipe/configuration linked to a yaml config file. Use absolute url path.
- Download: url of the pretrained model weights. Use absolute url path.
-->

根据我们的实验,在公开基准数据集(IC03,IC13,IC15,IIIT,SVT,SVTP,CUTE)上的评估结果如下:

<div align="center">

| 模型| 骨干网络 | 配置文件 | 平均准确率 | 模型下载 |
|------|----------|--------|--------------|----------|
| CRNN | VGG7 | [crnn_vgg7.yaml](./crnn_vgg7.yaml) | 82.03 | [model_weights]() |
| CRNN | ResNet34 | [crnn_resnet34.yaml](./crnn_resnet34.yaml) | 84.45 | [model_weights]() |


</div>

#### 注释
- VGG 和 ResNet 模型都是从头开始训练的,无需任何预训练。
- 上述模型是用 MJSynth(MJ)和 SynthText(ST)数据集训练的。更多数据详情,请参考 [数据集准备](#数据集准备)。
- 评估在每个基准数据集上单独测试,平均准确度是所有子数据集的精度平均值。


## 快速开始
### 环境及数据准备

#### 安装
环境安装教程请参考MindOCR的 [installation instruction](https://github.com/mindspore-lab/mindocr#installation).

#### 数据集准备
LMDB格式的训练及验证数据集可以从[这里](https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0) (出处: [deep-text-recognition-benchmark](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here))下载。连接中的文件包含多个压缩文件,其中:
- `data_lmdb_release.zip` 包含了完整的一套数据集,有训练集,验证集以及测试集。
- `validation.zip` 是一个整合的验证集。
- `evaluation.zip` 包含多个基准评估数据集。

### 模型训练
<!--- Guideline: Avoid using shell script in the command line. Python script preferred. -->

* 分布式训练

使用预定义的训练配置可以轻松重现报告的结果。对于在多个昇腾910设备上的分布式训练,请将配置参数**distribute**修改为**True**,并运行:

```shell
# 在多个 GPU/Ascend 设备上进行分布式训练
mpirun -n 8 python tools/train.py --config configs/rec/crnn/crnn_resnet34.yaml
```
> 如果脚本由 root 用户执行,则必须将 `--allow-run-as-root` 参数添加到 `mpirun` 中。

同样,也可以使用上述`mpirun`命令在多个 GPU 设备上训练模型。

**注意:** 由于全局批大小 (batch_size x num_devices) 是一个重要的超参数,因此建议保持全局批大小不变以进行重现,或将学习率线性调整为新的全局批大小。

* 单卡训练

如果要在没有分布式训练的情况下在较小的数据集上训练或微调模型,请将配置参数 **distribute** 修改为 **False** 并运行:

```shell
# CPU/GPU/Ascend 设备上的单卡训练
python tools/train.py --config configs/rec/crnn/crnn_resnet34.yaml
```

### 模型评估

若要评估已训练模型的准确性,可以使用`eval.py`。请在`eval`部分添加额外的配置参数**ckpt_load_path**,并将其设置为模型的路径,然后运行:

```
python tools/eval.py --config configs/rec/crnn/crnn_vgg7.yaml
```

## 参考文献
<!--- Guideline: Citation format GB/T 7714 is suggested. -->

[1] Baoguang Shi, Xiang Bai, Cong Yao. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. arXiv preprint arXiv:1507.05717, 2015.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
system:
mode: 0 # 0 for graph mode, 1 for pynative mode in MindSpore
distribute: False
distribute: True
amp_level: 'O3'
seed: 42
val_while_train: True
ckpt_save_dir: './tmp_rec'
drop_overflow_update: True
drop_overflow_update: False

common:
character_dict_path: &character_dict_path #mindocr/utils/dict/en_dict.txt
Expand Down Expand Up @@ -39,7 +38,8 @@ metric:
name: RecMetric
main_indicator: acc
character_dict_path: *character_dict_path
ignore_space: True
ignore_space: True
print_flag: False

loss:
name: CTCLoss
Expand All @@ -51,10 +51,9 @@ scheduler:
scheduler: warmup_cosine_decay
min_lr: 0.0
lr: 0.0005
num_epochs: 40
num_epochs: 30
warmup_epochs: 1
# warmup_steps: 500
decay_epochs: 39
decay_epochs: 29

optimizer:
opt: adamw
Expand All @@ -66,10 +65,11 @@ optimizer:
#use_nesterov: True

train:
ckpt_save_dir: './tmp_rec'
dataset_sink_mode: False
dataset:
type: LMDBDataset
data_dir: /old/katekong/crnn/datasets/ocr-datasets/data_lmdb_release/training/MJ/MJ_train/
data_dir: path/to/datadir/train/
# label_files: /data/ocr_datasets/ic15/word_recognition/rec_gt_train.txt
sample_ratios: [1.0]
shuffle: True
Expand All @@ -81,11 +81,12 @@ train:
max_text_len: *max_text_len
character_dict_path: *character_dict_path
use_space_char: *use_space_char
lower: True
- RecResizeImg: # different from paddle (paddle converts image from HWC to CHW and rescale to [-1, 1] after resize.
image_shape: [32, 100] # H, W
infer_mode: *infer_mode
character_dict_path: *character_dict_path
padding: True # aspect ratio will be preserved if true.
padding: False # aspect ratio will be preserved if true.
- NormalizeImage: # different from paddle (paddle wrongly normalize BGR image with RGB mean/std from ImageNet for det, and simple rescale to [-1, 1] in rec.
bgr_to_rgb: True
is_hwc: True
Expand All @@ -105,10 +106,11 @@ train:
num_workers: 8

eval:
ckpt_load_path: './tmp_rec/best.ckpt'
dataset_sink_mode: False
dataset:
type: LMDBDataset
data_dir: /old/katekong/crnn/datasets/ocr-datasets/validation/
data_dir: path/to/datadir/validation/
# label_files: /data/ocr_datasets/ic15/word_recognition/rec_gt_train.txt
sample_ratios: [1.0]
shuffle: False
Expand All @@ -120,11 +122,12 @@ eval:
max_text_len: *max_text_len
character_dict_path: *character_dict_path
use_space_char: *use_space_char
lower: True
- RecResizeImg: # different from paddle (paddle converts image from HWC to CHW and rescale to [-1, 1] after resize.
image_shape: [32, 100] # H, W
infer_mode: *infer_mode
character_dict_path: *character_dict_path
padding: True # aspect ratio will be preserved if true.
padding: False # aspect ratio will be preserved if true.
- NormalizeImage: # different from paddle (paddle wrongly normalize BGR image with RGB mean/std from ImageNet for det, and simple rescale to [-1, 1] in rec.
bgr_to_rgb: True
is_hwc: True
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
system:
mode: 0 # 0 for graph mode, 1 for pynative mode in MindSpore
distribute: False
distribute: True
amp_level: 'O3'
seed: 42
val_while_train: True
ckpt_save_dir: './tmp_rec'
drop_overflow_update: False

common:
Expand Down Expand Up @@ -39,7 +38,8 @@ metric:
name: RecMetric
main_indicator: acc
character_dict_path: *character_dict_path
ignore_space: True
ignore_space: True
print_flag: False

loss:
name: CTCLoss
Expand All @@ -52,8 +52,8 @@ scheduler:
min_lr: 0.0
lr: 0.0005
num_epochs: 10
warmup_epochs: 0
decay_epochs: 10
warmup_epochs: 1
decay_epochs: 9

optimizer:
opt: adamw
Expand All @@ -65,10 +65,11 @@ optimizer:
#use_nesterov: True

train:
ckpt_save_dir: './tmp_rec'
dataset_sink_mode: False
dataset:
type: LMDBDataset
data_dir: /old/katekong/crnn/datasets/ocr-datasets/data_lmdb_release/training/MJ/MJ_train/
data_dir: path/to/datadir/train/
# label_files: /data/ocr_datasets/ic15/word_recognition/rec_gt_train.txt
sample_ratios: [1.0]
shuffle: True
Expand All @@ -80,11 +81,12 @@ train:
max_text_len: *max_text_len
character_dict_path: *character_dict_path
use_space_char: *use_space_char
lower: True
- RecResizeImg: # different from paddle (paddle converts image from HWC to CHW and rescale to [-1, 1] after resize.
image_shape: [32, 100] # H, W
infer_mode: *infer_mode
character_dict_path: *character_dict_path
padding: True # aspect ratio will be preserved if true.
padding: False # aspect ratio will be preserved if true.
- NormalizeImage: # different from paddle (paddle wrongly normalize BGR image with RGB mean/std from ImageNet for det, and simple rescale to [-1, 1] in rec.
bgr_to_rgb: True
is_hwc: True
Expand All @@ -104,10 +106,11 @@ train:
num_workers: 8

eval:
ckpt_load_path: './tmp_rec/best.ckpt'
dataset_sink_mode: False
dataset:
type: LMDBDataset
data_dir: /old/katekong/crnn/datasets/ocr-datasets/validation/
data_dir: path/to/datadir/validation/
# label_files: /data/ocr_datasets/ic15/word_recognition/rec_gt_train.txt
sample_ratios: [1.0]
shuffle: False
Expand All @@ -119,11 +122,12 @@ eval:
max_text_len: *max_text_len
character_dict_path: *character_dict_path
use_space_char: *use_space_char
lower: True
- RecResizeImg: # different from paddle (paddle converts image from HWC to CHW and rescale to [-1, 1] after resize.
image_shape: [32, 100] # H, W
infer_mode: *infer_mode
character_dict_path: *character_dict_path
padding: True # aspect ratio will be preserved if true.
padding: False # aspect ratio will be preserved if true.
- NormalizeImage: # different from paddle (paddle wrongly normalize BGR image with RGB mean/std from ImageNet for det, and simple rescale to [-1, 1] in rec.
bgr_to_rgb: True
is_hwc: True
Expand All @@ -136,7 +140,7 @@ eval:

loader:
shuffle: False # TODO: tbc
batch_size: 64
batch_size: 16
drop_remainder: True
max_rowsize: 12
num_workers: 8
Loading