SSL-OCR (Telugu Handwritten Word Recognition)

This repository is adapted for Telugu handwritten word recognition with a CTC objective and grapheme-level vocabulary.

Dataset Layout

images/                   # all .jpg handwritten word images
labels.csv                # columns: image_id,text
vocab.txt                 # one line: <grapheme>\t<index>
train.txt                 # one line: <image_id> <idx1> <idx2> ...
valid.txt
test.txt

Notes

image_id in labels.csv is authoritative for image lookup (image_id + '.jpg').
text is already clean Telugu word label.
UTF-8 is required for all Telugu text files.

Install

pip install torch torchvision pillow pandas numpy tqdm editdistance

Train

export PYTHONIOENCODING=utf-8
python train.py \
  --vocab_file vocab.txt \
  --train_file train.txt \
  --val_file valid.txt \
  --labels_csv labels.csv \
  --image_dir ./images/ \
  --img_height 64 \
  --max_width 512 \
  --batch_size 32 \
  --num_workers 4

Fine-tune

python fine_tune.py \
  --vocab_file vocab.txt \
  --train_file train.txt \
  --val_file valid.txt \
  --labels_csv labels.csv \
  --image_dir ./images/ \
  --pretrained_encoder_path /path/to/encoder.pt

Test

export PYTHONIOENCODING=utf-8
python test.py \
  --vocab_file vocab.txt \
  --test_file test.txt \
  --labels_csv labels.csv \
  --image_dir ./images/ \
  --test_model ./weights/best_telugu_ctc.pt

Predictions are written to pred_logs/test_predictions.tsv.

Key Telugu Handling Rules

Tokenization is grapheme-level via TeluguVocab.encode() (greedy longest match).
CTC blank index is vocab_size.
CER is computed on grapheme token sequences, not Unicode codepoints.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data/backgroundIAM		data/backgroundIAM
imgs		imgs
models		models
Config.py		Config.py
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
environment.yml		environment.yml
fine_tune.py		fine_tune.py
htrAugmentor.py		htrAugmentor.py
loadData.py		loadData.py
loadData_pretrain.py		loadData_pretrain.py
pretrain.py		pretrain.py
test.py		test.py
train.py		train.py
utils.py		utils.py
vocab.py		vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SSL-OCR (Telugu Handwritten Word Recognition)

Dataset Layout

Notes

Install

Train

Fine-tune

Test

Key Telugu Handling Rules

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SSL-OCR (Telugu Handwritten Word Recognition)

Dataset Layout

Notes

Install

Train

Fine-tune

Test

Key Telugu Handling Rules

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages