English | 中文
FCENet: Fourier Contour Embedding for Arbitrary-Shaped Text Detection
FCENet is a segmentation-based text detection algorithm. In the text detection scene, algorithms based on segmentation are becoming increasingly popular as they can accurately describe arbitrary shapes of text, including highly-curved text. One of the key highlights of FCENet is its excellent performance on arbitrary-shaped text scenes, which is achieved through deformable convolutions [1] and Fourier transform techniques. Additionally, FCENet possesses advantages of simple post-processing and high generalization, allowing it to achieve good results even with limited training data.
The idea of deformable convolution is very simple, which is to change the fixed shape of the convolution kernel into a variable one. Based on the position of the original convolution, deformable convolution will generate a random position shift, as shown in the following figure:
Figure 1. Deformable Convolution
Figure (a) is the original convolutional kernel, Figure (b) is a deformable convolutional kernel that generates random directional position shifts, and Figure (c) and (d) are two special cases of Figure (b). It can be seen that the advantage of this is that it can improve the Geometric transformation ability of the convolution kernel, so that it is not limited to the shape of the original convolution kernel rectangle, but can support more abundant irregular shapes. Deformable convolution performs better in extracting irregular shape features [1] and is more suitable for text recognition scenarios in natural scenes.
Fourier contour is a curve fitting method based on Fourier transform. As the number of Fourier degree k increases, more high-frequency signals will be introduced, and the contour description will be more accurate. The following figure shows the ability to describe irregular curves under different Fourier degree:
Figure 2. Fourier contour fitting with progressive approximation
It can be seen that as the Fourier degree k increases, the curves it can depict can become very complicated.
Fourier Contour Encoding is a method proposed in the paper "Fourier Contour Embedding for Arbitrary Shaped Text Detection" to convert the closed text contour curve into a vector. It is also a fundamental ability required for FCENet algorithm to encode contour lines. This method samples points at equal intervals on the contour line, and then converts the sequence of sampled points into Fourier feature vectors. It is worth noting that even for the same contour line, if the sampling points are different, the corresponding generated Fourier feature vectors are not the same. So when sampling, it is necessary to limit the starting point, uniform speed, and sampling direction to ensure the uniqueness of Fourier feature vectors generated for the same contour line.
Figure 3. FCENet framework
Like most OCR algorithms, the structure of FCENet can be roughly divided into three parts: backbone, neck, and head. The backbone uses a deformable convolutional version of Resnet50 for feature extraction; The neck section adopts a feature pyramid [2], which is a set of convolutional kernels of different sizes, suitable for extracting features of different sizes from the original image, thereby improving the accuracy of object detection. It suits scenes that there are a few text boxes of different sizes in one image; The head part has two branches, one is the classification branch. The classification branch predicts the heat maps of both text regions and text center regions, which are pixel-wise multiplied, resulting in the the classification score map. The loss of classification branch is calculated by the cross entropy between prediction heat maps and ground truth. The regression branch predicts the Fourier signature vectors, which are used to reconstruct text contours via the Inverse Fourier transformation (IFT). Calculate the smooth-l1 loss of the reconstructed text contour and the ground truth contour in the image space as the loss value of the regression branch.
The FCENet in MindOCR is trained on ICDAR 2015 dataset. The training results are as follows:
Model | Context | Backbone | Pretrained | Recall | Precision | F-score | Train T. | Throughput | Train Step T. | Recipe | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
FCENet | D910x4-MS2.0-F | ResNet50 | ImageNet | 81.51% | 86.90% | 84.12% | 95.59 s/epoch | 10.36 img/s | 2978.65 ms/step | yaml | ckpt | mindir |
- Context: Training context denoted as {device}x{pieces}-{MS version}{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
- Note that the training time of FCENet is highly affected by data processing and varies on different machines.
- The input_shape for exported FCENet MindIR in the link are
(1,3,736,1280)
Please refer to the installation instruction in MindOCR.
Please download ICDAR2015 dataset, and convert the labels to the desired format referring to dataset_converters.
The prepared dataset file struture should be:
.
├── test
│ ├── images
│ │ ├── img_1.jpg
│ │ ├── img_2.jpg
│ │ └── ...
│ └── test_det_gt.txt
└── train
├── images
│ ├── img_1.jpg
│ ├── img_2.jpg
│ └── ....jpg
└── train_det_gt.txt
Update configs/det/fcenet/fce_icdar15.yaml
configuration file with data paths,
specifically the following parts. The dataset_root
will be concatenated with data_dir
and label_file
respectively to be the complete dataset directory and label file path.
...
train:
ckpt_save_dir: './tmp_det_fcenet'
dataset_sink_mode: False
ema: True
dataset:
type: DetDataset
dataset_root: dir/to/dataset <--- Update
data_dir: train/images <--- Update
label_file: train/train_det_gt.txt <--- Update
...
eval:
ckpt_load_path: '/best.ckpt' <--- Update
dataset_sink_mode: False
dataset:
type: DetDataset
dataset_root: dir/to/dataset <--- Update
data_dir: test/images <--- Update
label_file: test/test_det_gt.txt <--- Update
...
Optionally, change
num_workers
according to the cores of CPU.
FCENet consists of 3 parts: backbone
, neck
, and head
. Specifically:
model:
resume: False
type: det
transform: null
backbone:
name: det_resnet50 # Only ResNet50 is supported at the moment
pretrained: True # Whether to use weights pretrained on ImageNet
neck:
name: FCEFPN # FPN part of the FCENet
out_channels: 256
head:
name: FCEHead
scales: [ 8, 16, 32 ]
alpha: 1.2
beta: 1.0
fourier_degree: 5
num_sample: 50
- Standalone training
Please set distribute
in yaml config file to be False.
python tools/train.py -c=configs/det/fcenet/fce_icdar15.yaml
- Distributed training
Please set distribute
in yaml config file to be True.
# n is the number of NPUs
mpirun --allow-run-as-root -n 2 python tools/train.py --config configs/det/fcenet/fce_icdar15.yaml
The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg ckpt_save_dir
in yaml config file. The default directory is ./tmp_det
.
To evaluate the accuracy of the trained model, you can use eval.py
. Please set the checkpoint path to the arg ckpt_load_path
in the eval
section of yaml config file, set distribute
to be False, and then run:
python tools/eval.py -c=configs/det/fcenet/fce_icdar15.yaml
Please refer to the tutorial MindOCR Inference for model inference based on MindSpot Lite on Ascend 310, including the following steps:
- Model Export
Please download the exported MindIR file first, or refer to the Model Export tutorial and use the following command to export the trained ckpt model to MindIR file:
python tools/export.py --model_name_or_config fcenet_resnet50 --data_shape 736 1280 --local_ckpt_path /path/to/local_ckpt.ckpt
# or
python tools/export.py --model_name_or_config configs/det/fcenet/fce_icdar15.yaml --data_shape 736 1280 --local_ckpt_path /path/to/local_ckpt.ckpt
The data_shape
is the model input shape of height and width for MindIR file. The shape value of MindIR in the download link can be found in ICDAR2015 Notes.
- Environment Installation
Please refer to Environment Installation tutorial to configure the MindSpore Lite inference environment.
- Model Conversion
Please refer to Model Conversion,
and use the converter_lite
tool for offline conversion of the MindIR file.
- Inference
Assuming that you obtain output.mindir after model conversion, go to the deploy/py_infer
directory, and use the following command for inference:
python infer.py \
--input_images_dir=/your_path_to/test_images \
--det_model_path=your_path_to/output.mindir \
--det_model_name_or_config=../../configs/det/fcenet/fce_icdar15.yaml \
--res_save_dir=results_dir
[1] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable Convolutional Networks. 2017 IEEE International Conference on Computer Vision (ICCV), 764-773.
[2] T. Lin, et al., "Feature Pyramid Networks for Object Detection," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017 pp. 936-944. doi: 10.1109/CVPR.2017.106