-
Problem: Building a lightweight model suitable for mobile devices to perform Vietnamese Handwritten OCR in the context of Vietnamese addresses
-
Output: the text in the input image
-
Metric: the custom of edit distance between output with lable
-
Requirements:
- Model size <= 50mb
- Inference time <= 2s
- No pretrained model for OCR task or handwritten dataset
-
Some issues with data:
- White space at the end of the image.
- Short text lacking linguistic context.
- Excessive use of colors.
- Two lines of text.
- Text not fully visible.
- Empty images.
-
Ideas:
- Choose a very lightweight OCR model: SVTR
- Train a pretrained model with generated data
- Finetune on the real dataset
|___data
| |___train
| | |___images
| | | |___0.jpg
| | | |___...
| | |___labels
| | | |___0.txt
| | | |___...
| |___val
| | |___images
| | | |___0.jpg
| | | |___...
| | |___labels
| | | |___0.txt
| | | |___...
-
Collect address text:
- Extract data from an Excel file provided by the government.
- Get text label from other OCR datasets
- Crawl information on villages from Google.
-
To generate data, use some handwritten fonts and the text corpus to generate with my repo OCR-Handwritten-Text-Generator
-
Then, apply some augmentation in above repository
-
Total: 250k - 350k images
-
Manually check to crop 2 line image and correct the label
-
To crop image to remove the white part at the end, help handle the empty image
python3 main.py --scenario preprocess \
--raw_data_path "./path/to/raw/data/"
- Then, create lmdb data from raw data:
python3 main.py --scenario create_lmdb_data \
--raw_data_path "./data/OCR/training_data" \
--raw_data_type "folder" \
--data_mode "train" \
--lmdb_data_path "./data/kalapa_lmdb/"
- Flag:
raw_data_path
: path to raw dataraw_data_type
: have 3 values:json
: a dir contains image and a json file with each line contains path to image and text label.folder
: a dir contains image subdirs and a dir contains subfile .txt label.other
: the second gen type from my repo.
data_mode
: train data or eval datalmdb_data_path
: path to output lmdb data
- To run training:
python3 main.py --scenario train \
--model SVTR \
--lmdb_data_path "./data/kalapa_lmdb/"
--batch_size 16
--num_epoch 1000
- To run inference test:
python3 main.py --scenario infer --image_test_path "path/to/image.jpg"
- To handle some cases that not show fully sigh or very ugly text, or model is wrong -> decode using beamsearch with ngram model
- To build ngram model from the text file generated from the preprocess part, go https://github.com/kmario23/KenLM-training
- To export model to onnx (optional):
python3 export_onnx.py
- To run infer with a folder:
- run in batch:
python3 submission.py
- run each image:
python3 torch_submission.py
- run each image with onnx:
python3 onnx_submission.py
- run in batch: