Skip to content

add ctw1500 dataset convert #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,8 @@ We give instructions on how to download the following datasets.

- [x] MSRA-TD500 [paper](https://ieeexplore.ieee.org/abstract/document/6247787) [homepage](http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500)) [download instruction](docs/en/datasets/td500.md)

- [x] SCUT-CTW1500 [paper](https://www.sciencedirect.com/science/article/pii/S0031320319300664) [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) [download instruction](docs/en/datasets/ctw1500.md)

</details>

### Conversion
Expand Down
2 changes: 2 additions & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,8 @@ MindOCR支持使用MindOCR训练好的ckpt文件进行文本检测+文本识别

- [x] MSRA-TD500 [论文](https://ieeexplore.ieee.org/abstract/document/6247787) [主页](http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500)) [下载说明](docs/cn/datasets/td500_CN.md)

- [x] SCUT-CTW1500 [论文](https://www.sciencedirect.com/science/article/pii/S0031320319300664) [主页](https://github.com/Yuliang-Liu/Curve-Text-Detector) [下载说明](docs/cn/datasets/ctw1500_CN.md)

</details>

### 转换
Expand Down
53 changes: 53 additions & 0 deletions docs/cn/datasets/ctw1500_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
[English](../../en/datasets/ctw1500.md) | 中文

# SCUT-CTW1500 Datasets

## 数据下载
文本检测数据集(SCUT-CTW1500)[官网](https://github.com/Yuliang-Liu/Curve-Text-Detector)

[下载数据集](https://github.com/Yuliang-Liu/Curve-Text-Detector)

请从上述网站下载数据并解压缩文件。解压文件后,数据结构应该是这样的:

```txt
ctw1500
├── ctw1500_train_labels
│ ├── 0001.xml
│ ├── 0002.xml
│ ├── ...
├── gt_ctw_1500
│ ├── 0001001.txt
│ ├── 0001002.txt
│ ├── ...
├── test_images
│ ├── 1001.jpg
│ ├── 1002.jpg
│ ├── ...
├── train_images
│ ├── 0001.jpg
│ ├── 0002.jpg
│ ├── ...
```

## 数据准备

### 检测任务

要准备用于文本检测的数据,您可以运行以下命令:

```bash
python tools/dataset_converters/convert.py \
--dataset_name ctw1500 --task det \
--image_dir path/to/ctw1500/train_images/ \
--label_dir path/to/ctw1500/ctw_1500_train_labels \
--output_path path/to/ctw1500/train_det_gt.txt
```
```bash
python tools/dataset_converters/convert.py \
--dataset_name ctw1500 --task det \
--image_dir path/to/ctw1500/test_images/ \
--label_dir path/to/ctw1500/gt_ctw_1500 \
--output_path path/to/ctw1500/test_det_gt.txt
```

运行后,在文件夹 `ctw1500/` 下有两个注释文件 `train_det_gt.txt` 和 `test_det_gt.txt`。
54 changes: 54 additions & 0 deletions docs/en/datasets/ctw1500.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
English | [中文](../../cn/datasets/ctw1500_CN.md)

# SCUT-CTW1500 Datasets

## Data Downloading
SCUT-CTW1500 Datasets [official website](https://github.com/Yuliang-Liu/Curve-Text-Detector)

[download dataset](https://github.com/Yuliang-Liu/Curve-Text-Detector)

Please download the data from the website above and unzip the file.
After unzipping the file, the data structure should be like:

```txt
ctw1500
├── ctw1500_train_labels
│ ├── 0001.xml
│ ├── 0002.xml
│ ├── ...
├── gt_ctw_1500
│ ├── 0001001.txt
│ ├── 0001002.txt
│ ├── ...
├── test_images
│ ├── 1001.jpg
│ ├── 1002.jpg
│ ├── ...
├── train_images
│ ├── 0001.jpg
│ ├── 0002.jpg
│ ├── ...
```

## Data Preparation

### For Detection task

To prepare the data for text detection, you can run the following commands:

```bash
python tools/dataset_converters/convert.py \
--dataset_name ctw1500 --task det \
--image_dir path/to/ctw1500/train_images/ \
--label_dir path/to/ctw1500/ctw_1500_train_labels \
--output_path path/to/ctw1500/train_det_gt.txt
```
```bash
python tools/dataset_converters/convert.py \
--dataset_name ctw1500 --task det \
--image_dir path/to/ctw1500/test_images/ \
--label_dir path/to/ctw1500/gt_ctw_1500 \
--output_path path/to/ctw1500/test_det_gt.txt
```

Then you can have two annotation files `train_det_gt.txt` and `test_det_gt.txt` under the folder `ctw1500/`.
39 changes: 39 additions & 0 deletions tools/convert_datasets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -301,3 +301,42 @@ else
--output_path $DIR/MSRA-TD500/test_det_gt.txt
fi
fi

##########################ctw1500#########################
DIR="$DATASETS_DIR/ctw1500"
if [ ! -d $DIR ] || [ ! "$(ls -A $DIR)" ]; then
echo "ctw1500 is Empty! Skipped."
else
unzip $DIR/train_images.zip -d $DIR/
rm $DIR/train_images.zip

unzip $DIR/test_images.zip -d $DIR/
rm $DIR/test_images.zip

unzip $DIR/ctw1500_train_labels.zip -d $DIR/
rm $DIR/ctw1500_train_labels.zip

unzip $DIR/gt_ctw1500.zip -d $DIR/gt_ctw1500/
rm $DIR/gt_ctw1500.zip

if test -f "$DIR/train_det_gt.txt"; then
echo "$DIR/train_det_gt.txt exists."
else
python tools/dataset_converters/convert.py \
--dataset_name ctw1500 \
--task det \
--image_dir $DIR/train_images/ \
--label_dir $DIR/ctw1500_train_labels/ \
--output_path $DIR/train_det_gt.txt
fi
if test -f "$DIR/test_det_gt.txt"; then
echo "$DIR/test_det_gt.txt exists."
else
python tools/dataset_converters/convert.py \
--dataset_name ctw1500 \
--task det \
--image_dir $DIR/test_images/ \
--label_dir $DIR/gt_ctw1500/ \
--output_path $DIR/test_det_gt.txt
fi
fi
3 changes: 2 additions & 1 deletion tools/dataset_converters/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@
from syntext150k import SYNTEXT150K_Converter
from svt import SVT_Converter
from td500 import TD500_Converter
from ctw1500 import CTW1500_Converter

supported_datasets = ['ic15', 'totaltext', 'mlt2017', 'syntext150k', 'svt', 'td500']
supported_datasets = ['ic15', 'totaltext', 'mlt2017', 'syntext150k', 'svt', 'td500', 'ctw1500']


def convert(dataset_name, task, image_dir, label_path, output_path=None, path_mode='relative'):
Expand Down
76 changes: 76 additions & 0 deletions tools/dataset_converters/ctw1500.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import os
import json
import glob
import xml.etree.ElementTree as ET


class CTW1500_Converter(object):
'''
Format annotation to standard form for CTW-1500 dataset.
'''
def __init__(self, path_mode='relative'):
self.path_mode = path_mode

def convert(self, task='det', image_dir=None, label_path=None, output_path=None):
self.label_path = label_path
assert os.path.exists(label_path), f'{label_path} no exist!'

if task == 'det':
self._format_det_label(image_dir, self.label_path, output_path)
else:
raise ValueError("ctw1500 currently only support detection.")


def _format_det_label(self, image_dir, label_dir, output_path):
label_paths = sorted(glob.glob(os.path.join(label_dir, '*.txt')))
if label_paths:
with open(output_path, 'w') as out_file:
for label_fp in label_paths:
label_file_name = os.path.basename(label_fp)
img_path = os.path.join(image_dir, label_file_name.split('.')[0][3:] + ".jpg")
assert os.path.exists(img_path), f'{img_path} not exist! Please check the input image_dir {image_dir} and names in {label_fp}'
label = []
if self.path_mode == 'relative':
img_path = os.path.basename(img_path)
with open(label_fp, 'r', encoding='utf-8-sig') as f:
for line in f.readlines():
tmp = line.strip("\n\r").split(',####')
assert len(tmp), f"parse error for {tmp}."
points = tmp[0].split(',')
assert len(points) % 2 == 0, f'The length of the points should be an even number, but get {len(points)}'
s = []
for i in range(0, len(points), 2):
b =[int(points[i]), int(points[i+1])]
s.append(b)
result = {"transcription": tmp[-1], "points": s}
label.append(result)

out_file.write(img_path + '\t' + json.dumps(
label, ensure_ascii=False) + '\n')

else:
label_paths = sorted(glob.glob(os.path.join(label_dir, '*.xml')))
with open(output_path, 'w') as out_file:
for label_fp in label_paths:
label_file_name = os.path.basename(label_fp)
img_path = os.path.join(image_dir, label_file_name.split('.')[0] + ".jpg")
assert os.path.exists(img_path), f'{img_path} not exist! Please check the input image_dir {image_dir} and names in {label_fp}'
label = []
if self.path_mode == 'relative':
img_path = os.path.basename(img_path)
tree = ET.parse(label_fp)
for obj in tree.findall('image'):
for tmp in obj.findall('box'):
annotation = tmp.find('label').text
points = tmp.find('segs').text.split(",")

assert len(points) == 28, f'The length of the points should be 28, but get {len(points)}'
s = []
for i in range(0, len(points), 2):
b =[int(points[i]), int(points[i+1])]
s.append(b)
result = {"transcription": annotation, "points": s}
label.append(result)

out_file.write(img_path + '\t' + json.dumps(
label, ensure_ascii=False) + '\n')