mindspore-lab · SamitHuang · May 10, 2023 · May 9, 2023
diff --git a/README.md b/README.md
@@ -151,6 +151,8 @@ We give instructions on how to download the following datasets.
 
 - [x] MSRA-TD500 [paper](https://ieeexplore.ieee.org/abstract/document/6247787) [homepage](http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500)) [download instruction](docs/en/datasets/td500.md)
 
+- [x] SCUT-CTW1500 [paper](https://www.sciencedirect.com/science/article/pii/S0031320319300664) [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) [download instruction](docs/en/datasets/ctw1500.md)
+
 </details>
 
 ### Conversion

diff --git a/README_CN.md b/README_CN.md
@@ -141,6 +141,8 @@ MindOCR支持使用MindOCR训练好的ckpt文件进行文本检测+文本识别
 
 - [x] MSRA-TD500 [论文](https://ieeexplore.ieee.org/abstract/document/6247787) [主页](http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500)) [下载说明](docs/cn/datasets/td500_CN.md)
 
+- [x] SCUT-CTW1500 [论文](https://www.sciencedirect.com/science/article/pii/S0031320319300664) [主页](https://github.com/Yuliang-Liu/Curve-Text-Detector) [下载说明](docs/cn/datasets/ctw1500_CN.md)
+
 </details>
 
 ### 转换

diff --git a/docs/cn/datasets/ctw1500_CN.md b/docs/cn/datasets/ctw1500_CN.md
@@ -0,0 +1,53 @@
+[English](../../en/datasets/ctw1500.md) | 中文
+
+# SCUT-CTW1500 Datasets
+
+## 数据下载
+文本检测数据集（SCUT-CTW1500）[官网](https://github.com/Yuliang-Liu/Curve-Text-Detector)
+
+[下载数据集](https://github.com/Yuliang-Liu/Curve-Text-Detector)
+
+请从上述网站下载数据并解压缩文件。解压文件后，数据结构应该是这样的：
+
+```txt
+ctw1500
+ ├── ctw1500_train_labels
+ │   ├── 0001.xml 
+ │   ├── 0002.xml
+ │   ├── ...
+ ├── gt_ctw_1500
+ │   ├── 0001001.txt
+ │   ├── 0001002.txt
+ │   ├── ...
+ ├── test_images
+ │   ├── 1001.jpg
+ │   ├── 1002.jpg
+ │   ├── ...
+ ├── train_images
+ │   ├── 0001.jpg
+ │   ├── 0002.jpg
+ │   ├── ...
+```
+
+## 数据准备
+
+### 检测任务
+
+要准备用于文本检测的数据，您可以运行以下命令：
+
+```bash
+python tools/dataset_converters/convert.py \
+    --dataset_name ctw1500 --task det \
+    --image_dir path/to/ctw1500/train_images/ \
+    --label_dir path/to/ctw1500/ctw_1500_train_labels \
+    --output_path path/to/ctw1500/train_det_gt.txt 
+```
+```bash
+python tools/dataset_converters/convert.py \
+    --dataset_name ctw1500 --task det \
+    --image_dir path/to/ctw1500/test_images/ \
+    --label_dir path/to/ctw1500/gt_ctw_1500 \
+    --output_path path/to/ctw1500/test_det_gt.txt 
+```
+
+运行后，在文件夹 `ctw1500/` 下有两个注释文件 `train_det_gt.txt` 和 `test_det_gt.txt`。
diff --git a/docs/en/datasets/ctw1500.md b/docs/en/datasets/ctw1500.md
@@ -0,0 +1,54 @@
+English | [中文](../../cn/datasets/ctw1500_CN.md)
+
+# SCUT-CTW1500 Datasets
+
+## Data Downloading
+SCUT-CTW1500 Datasets [official website](https://github.com/Yuliang-Liu/Curve-Text-Detector)
+
+[download dataset](https://github.com/Yuliang-Liu/Curve-Text-Detector)
+
+Please download the data from the website above and unzip the file.
+After unzipping the file, the data structure should be like:
+
+```txt
+ctw1500
+ ├── ctw1500_train_labels
+ │   ├── 0001.xml 
+ │   ├── 0002.xml
+ │   ├── ...
+ ├── gt_ctw_1500
+ │   ├── 0001001.txt
+ │   ├── 0001002.txt
+ │   ├── ...
+ ├── test_images
+ │   ├── 1001.jpg
+ │   ├── 1002.jpg
+ │   ├── ...
+ ├── train_images
+ │   ├── 0001.jpg
+ │   ├── 0002.jpg
+ │   ├── ...
+```
+
+## Data Preparation
+
+### For Detection task
+
+To prepare the data for text detection, you can run the following commands:
+
+```bash
+python tools/dataset_converters/convert.py \
+    --dataset_name ctw1500 --task det \
+    --image_dir path/to/ctw1500/train_images/ \
+    --label_dir path/to/ctw1500/ctw_1500_train_labels \
+    --output_path path/to/ctw1500/train_det_gt.txt 
+```
+```bash
+python tools/dataset_converters/convert.py \
+    --dataset_name ctw1500 --task det \
+    --image_dir path/to/ctw1500/test_images/ \
+    --label_dir path/to/ctw1500/gt_ctw_1500 \
+    --output_path path/to/ctw1500/test_det_gt.txt 
+```
+
+Then you can have two annotation files `train_det_gt.txt` and `test_det_gt.txt` under the folder `ctw1500/`.
diff --git a/tools/convert_datasets.sh b/tools/convert_datasets.sh
@@ -301,3 +301,42 @@ else
           --output_path $DIR/MSRA-TD500/test_det_gt.txt
   fi
 fi
+
+##########################ctw1500#########################
+DIR="$DATASETS_DIR/ctw1500"
+if  [ ! -d $DIR ] || [  ! "$(ls -A $DIR)"  ]; then
+  echo "ctw1500 is Empty! Skipped."
+else
+  unzip $DIR/train_images.zip -d  $DIR/
+  rm $DIR/train_images.zip
+
+  unzip $DIR/test_images.zip -d  $DIR/
+  rm $DIR/test_images.zip
+
+  unzip $DIR/ctw1500_train_labels.zip -d  $DIR/
+  rm $DIR/ctw1500_train_labels.zip
+
+  unzip $DIR/gt_ctw1500.zip -d  $DIR/gt_ctw1500/
+  rm $DIR/gt_ctw1500.zip
+
+  if test -f "$DIR/train_det_gt.txt"; then
+     echo "$DIR/train_det_gt.txt exists."
+  else
+     python tools/dataset_converters/convert.py \
+          --dataset_name  ctw1500 \
+          --task det \
+          --image_dir $DIR/train_images/ \
+          --label_dir $DIR/ctw1500_train_labels/ \
+          --output_path $DIR/train_det_gt.txt
+  fi
+  if test -f "$DIR/test_det_gt.txt"; then
+     echo "$DIR/test_det_gt.txt exists."
+  else
+     python tools/dataset_converters/convert.py \
+          --dataset_name  ctw1500 \
+          --task det \
+          --image_dir $DIR/test_images/ \
+          --label_dir $DIR/gt_ctw1500/ \
+          --output_path $DIR/test_det_gt.txt
+  fi
+fi
diff --git a/tools/dataset_converters/convert.py b/tools/dataset_converters/convert.py
@@ -23,8 +23,9 @@
 from syntext150k import SYNTEXT150K_Converter
 from svt import SVT_Converter
 from td500 import TD500_Converter
+from ctw1500 import CTW1500_Converter
 
-supported_datasets = ['ic15', 'totaltext', 'mlt2017', 'syntext150k', 'svt', 'td500']
+supported_datasets = ['ic15', 'totaltext', 'mlt2017', 'syntext150k', 'svt', 'td500', 'ctw1500']
 
 
 def convert(dataset_name, task, image_dir, label_path, output_path=None, path_mode='relative'):

diff --git a/tools/dataset_converters/ctw1500.py b/tools/dataset_converters/ctw1500.py
@@ -0,0 +1,76 @@
+import os
+import json
+import glob
+import xml.etree.ElementTree as ET
+
+
+class CTW1500_Converter(object):
+    '''
+    Format annotation to standard form for CTW-1500 dataset.
+    '''
+    def __init__(self, path_mode='relative'):
+        self.path_mode = path_mode
+
+    def convert(self, task='det', image_dir=None, label_path=None, output_path=None):
+        self.label_path = label_path
+        assert os.path.exists(label_path), f'{label_path} no exist!'
+
+        if task == 'det':
+            self._format_det_label(image_dir, self.label_path, output_path)
+        else:
+            raise ValueError("ctw1500 currently only support detection.")
+
+
+    def _format_det_label(self, image_dir, label_dir, output_path):
+        label_paths = sorted(glob.glob(os.path.join(label_dir, '*.txt')))
+        if label_paths:
+            with open(output_path, 'w') as out_file:
+                for label_fp in label_paths:
+                    label_file_name = os.path.basename(label_fp)
+                    img_path = os.path.join(image_dir, label_file_name.split('.')[0][3:] + ".jpg")
+                    assert os.path.exists(img_path), f'{img_path} not exist! Please check the input image_dir {image_dir} and names in {label_fp}'
+                    label = []
+                    if self.path_mode == 'relative':
+                        img_path = os.path.basename(img_path)
+                    with open(label_fp, 'r', encoding='utf-8-sig') as f:
+                        for line in f.readlines():
+                            tmp = line.strip("\n\r").split(',####')
+                            assert len(tmp), f"parse error for {tmp}."
+                            points = tmp[0].split(',')
+                            assert len(points) % 2 == 0, f'The length of the points should be an even number, but get {len(points)}'
+                            s = []
+                            for i in range(0, len(points), 2):
+                                b =[int(points[i]), int(points[i+1])]
+                                s.append(b)
+                            result = {"transcription": tmp[-1], "points": s}
+                            label.append(result)
+
+                    out_file.write(img_path + '\t' + json.dumps(
+                        label, ensure_ascii=False) + '\n')
+
+        else:
+            label_paths = sorted(glob.glob(os.path.join(label_dir, '*.xml')))
+            with open(output_path, 'w') as out_file:
+                for label_fp in label_paths:
+                    label_file_name = os.path.basename(label_fp)
+                    img_path = os.path.join(image_dir, label_file_name.split('.')[0] + ".jpg")
+                    assert os.path.exists(img_path), f'{img_path} not exist! Please check the input image_dir {image_dir} and names in {label_fp}'
+                    label = []
+                    if self.path_mode == 'relative':
+                        img_path = os.path.basename(img_path)
+                    tree = ET.parse(label_fp)
+                    for obj in tree.findall('image'):
+                        for tmp in obj.findall('box'):
+                            annotation = tmp.find('label').text
+                            points = tmp.find('segs').text.split(",")
+
+                            assert len(points) == 28, f'The length of the points should be 28, but get {len(points)}'
+                            s = []
+                            for i in range(0, len(points), 2):
+                                b =[int(points[i]), int(points[i+1])]
+                                s.append(b)
+                            result = {"transcription": annotation, "points": s}
+                            label.append(result)
+
+                    out_file.write(img_path + '\t' + json.dumps(
+                        label, ensure_ascii=False) + '\n')