This is an implementation of an end to end pipeline to detect and recognize text from youtube videos. The text detection is based on SSD: Single Shot MultiBox Detector retrained on single text class using Coco-Text dataset and text recognition is based on Convolutional Recurrent Neural Network as described in An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
Please see Demo notebook as a starting point. Use it to provide your youtube url to either:
- Get text detection/recognition results in JSON format (or)
- Generate a new video with overlayed bounding boxes for all text and their respective transcriptions.
All requirements are captured in the requirements.txt. Please switch to your virtual environment based on your preferences and install them (pip install -r requirements.txt)
- Demo.ipynb: Demo notebook as described above
- videotext.py: Main entry point which connects various pieces of the pipeline.
- detection.py and detection: detection.py abstracts detection functionality. See detection section for more details.
- recognition.py and crnn.pytorch: recognition.py abstracts recognition functionality. See recognition section for more details
- utilities.py: Holds all other helper functions required for E2E video text detection and recognition.
- data_explore_eval: Contains utilities specific to various datasets and evaluation scripts. Also contains scripts to generate submissions for ICDAR17 - Robust reading competition on Coco-text
Our detection model is based on Tensorflow's object detection models and the detection model zoo
We transfer learn on Mobile SSD network. The original network was trained on coco dataset (natural objects) for detection task. We retrain the network for text detection (single class) using Coco-Text dataset.
detection.py loads frozen Tensorflow inference graph and runs inference for our data.
Please follow instructions provided by Tensorflow's object detection along with scripts and configs provided in detection/ folder.
We have also experimented with faster-RCNN pretrained on Coco, for which we provide the config file as well.
- ssd_mobilenet_v1_coco.config
- faster_rcnn_resnet101_pets_coco.config
See text.pbtxt
Script used to generate TF records for use with this model is at coco-text/Coco-Text%20to%20TFRecords.ipynb
We leverage Convolutional Recurrent Network for recognition purposes.
recognition.py holds helper functions for recognition task. Loads Convolutional Recurrent Network weights file and runs inference. It is adapted from caffe implementation from paper authors Shi etal and pytorch implementation by @meijieru model. See crnn.pytorch folder for more details. Please see the original implementation for training instructions.
We have a basic web server serving video analysis requests. To start, execute following in this directory:
$ python flask_server.py
Files:
- flask_server.py - Contains basic flask app
- templates - Contains html for flask app
- static - Holds demo videos
See directory data_explore_eval/
Coco-text:
- coco-text: Helper functions to work with Coco-Text data. Also contains Coco-Text Preparation notebook to translate coco-text to TFRecord to use with Tensorflow detection model.
- Eval_Coco_text_val_set.ipynb: Contains code to evaluate our model on coco-text benchmark
SynthText:
- synth_utils.py: Helper script to prepare SynthText data
- SynthText Data Preparation notebook[In progress]: Scripts to translate Synthetext data to TFrecord to be used with Tensorflow detection model
Also contains script to generate submissions for ICDAR17 and run evaluations offline
Unit testing for video functionality is added. More tests need to be added. To run them:
$ python -m pytest test_utilities.py
Download weights from Google drive and put it into a folder named weights/