Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown

We are planning to build Deep Speech 2 (DS2) \[[1](#references)\], a powerful Automatic Speech Recognition (ASR) engine,  on PaddlePaddle. For the first-stage plan, we have the following short-term goals:

- Release a basic distributed implementation of DS2 on PaddlePaddle.
- Contribute a chapter of Deep Speech to PaddlePaddle Book.

Intensive system optimization and low-latency inference library (details in \[[1](#references)\]) are not yet covered in this first-stage plan.


## Tasks

We roughly break down the project into 14 tasks:

1. Develop an **audio data provider**:
    - Json filelist generator
    - Audio file format transformer.
    - Spectrogram feature extraction, power normalization etc.
    - Batch data reader with SortaGrad.
    - Data augmentation (optional).
    - Prepare (one or more) public English data sets & baseline.
    - https://github.com/PaddlePaddle/Paddle/issues/2226
    - https://github.com/PaddlePaddle/Paddle/issues/2227
2. Create a **simplified DS2 model configuration**:
    - With only fixed-length (by padding) audio sequences (otherwise need *Task 3*).
    - With only bidirectional-GRU (otherwise need *Task 4*).
    - With only greedy decoder (otherwise need *Task 5, 6*).
    - https://github.com/PaddlePaddle/Paddle/issues/2231
3. Develop to support **variable-shaped** dense-vector (image) batches of input data.
    - Update `DenseScanner` in `dataprovider_converter.py`, etc.
    - https://github.com/PaddlePaddle/Paddle/issues/2198
4. Develop a new **lookahead-row-convolution layer** (See \[[1](#references)\] for details):
    - Lookahead convolution windows.
    - Within-row convolution, without kernels shared across rows.
    - https://github.com/PaddlePaddle/Paddle/issues/2228
5. Build KenLM n-gram **language model** for beam search decoding:
    - Use KenLM toolkit, Kneser-Ney smoothed, 5-gram, with pruning etc.
    - Prepare the corpus & train the model.
    - Create infererence interfaces plugable to CTC beam search (for Task 6).
    - https://github.com/PaddlePaddle/Paddle/issues/2229
6. Develop a **beam search decoder** with CTC + LM + WORDCOUNT:
    - Beam search with CTC.
    - Beam search with external custom scorer (e.g. LM).
    - Try to design a more general beam search interface.
    - https://github.com/PaddlePaddle/Paddle/issues/2230
7. Develop a **Word Error Rate evaluator**:
    - update `ctc_error_evaluator`(CER) to support WER.
8. Prepare internal dataset for Mandarin (optional):
    - Dataset, baseline, evaluation details.
    - Particular data preprocessing for Mandarin.
    - Might need cooperating with the Department of Speech.
    - https://github.com/PaddlePaddle/Paddle/issues/2232
9. Create **standard DS2 model configuration**:
    - With variable-length audio sequences (need *Task 3*).
	- With unidirectional-GRU + row-convolution (need *Task 4*).
	- With CTC-LM beam search decoder (need *Task 5, 6*).
10. Make it run perfectly on **clusters**.
11. Experiments and **benchmarking** (for accuracy, not efficiency):
    - With public English dataset.
    - With internal (Baidu) Mandarin dataset (optional).
12. Time **profiling** and optimization.
13. Prepare **docs**.
14. Prepare PaddlePaddle **Book** chapter with a simplified version.



## Task Dependency

Tasks parallelizable within phases:

Roadmap  | Description | Parallelizable Tasks
------------- | :------------- | :------
Phase I	    | Basic model & components | *Task 1* ~ *Task 8*
Phase II  | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12*
Phase III | Documentations | *Task13* ~ *Task14*


Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!

## Possible Future Work

- Efficiency Improvement
- Accuracy Improvement
- Low-latency Inference Library
- Large-scale benchmarking


## References

1. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf). ICML 2016.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

Tasks

Task Dependency

Possible Future Work

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Roadmap	Description	Parallelizable Tasks
Phase I	Basic model & components	Task 1 ~ Task 8
Phase II	Standard model & benchmarking & profiling	Task 9 ~ Task 12
Phase III	Documentations	Task13 ~ Task14

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown #44

Description

Tasks

Task Dependency

Possible Future Work

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions