Skip to content

Latest commit

 

History

History
85 lines (64 loc) · 4.11 KB

README.md

File metadata and controls

85 lines (64 loc) · 4.11 KB

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. Most previous speech separation algorithms were tested on artificially mixed pre-segmented speech signals and thus bypassed overlap detection and speaker counting by implicitly assuming overlapped regions to be already extracted from the input audio. CSS is an attempt to directly process the continuously incoming audio signals with online processing. The main concept was established and its effectiveness was evaluated on real meeting recordings in [1]. As these recordings were proprietary, a publicly available dataset, called LibriCSS, has been prepared by the same research group in [2]. This repository contains the programs for LibriCSS evaluation.

[1] T. Yoshioka et al., "Advances in Online Audio-Visual Meeting Transcription," 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, 2019, pp. 276-283.

[2] Z. Chen et al., "Continuous speech separation: dataset and analysis," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, accepted for publication.

Requirements

We use SCTK (https://github.com/usnistgov/SCTK), the NIST Scoring Toolkit, for evaluation and PyKaldi2 (https://github.com/jzlianglu/pykaldi2), a Python internface to Kaldi for ASR. They can be installed as follows.

./install.sh

We also use some Python packages. Assuming you are using conda, the simplest way to install all required dependencies is to reate a conda environment as follows.

conda env create -f conda_env.yml

This creates a conda environment named libricss, which can be activated as follows.

source activate libricss

Getting Started

One-step script (UNDER DEVELOPMENT)

The following script executes all steps, including data preparation, ASR, and evaluation.

./run_all.sh

Step-by-step execution

Alternatively, you may run each step separately, which would be useful when you don't want to use the default ASR system.

  1. First, the data can be downloaded and preprocessed as follows.
    cd dataprep
    ./scripts/dataprep.sh
    
  2. Then, ASR can be run as
    <ASR command>
    
  3. Finally, the ASR results can be scored as follows.
    cd scoring
    ./scripts/eval_noproc.sh
    python ./python/report.py --inputdir ../sample --decode_cfg 13_0.0
    
    The last command should print out the baseline (i.e., w/o separation) WER for each overlap condition as follows.
    Result Summary
    --------------
    Condition: %WER
    0S       : 15.5
    0L       : 11.5
    10       : 21.9
    20       : 27.1
    30       : 34.7
    40       : 40.8
    
    This corresponds to the "no separation" results of Table 2 in [2].

Further Descriptions

Data

LibriCSS consists of distant microphone recordings of concatenated LibriSpeech utterances played back from loudspeakers in an office room, which enables evaluation of speech separation algorithms that handle long form audio. See [2] for details of the data.

The data can be downloaded at: https://drive.google.com/file/d/1Piioxd5G_85K9Bhcr8ebdhXx0CnaHy7l/view

The archive file contains only the original "mini-session" recordings (see Section 3.1 of [2]) as well as the source signals played back from the loudspeakers. By following the instruction described in the README file, you should be able to generate the data for both utterance-wise evaluation (see Section 3.3.2 of [2]) and continuous input evaluation (Section 3.3.3 of [2]).

Task

As a result of the data preparation step, the 7-ch and 1-ch test data are created under $EXPROOT/7ch and $EXPROOT/monaural, respectively. These directories consist of subdirectories named overlap_ratio_*_sil*_*_session*_actual*, each containing chunked mini-session audio files segment_*.wav (see Section 3.3.3 of [2]).

The task is to trascribe each file and save the result in the CTM format as segment_*.ctm. Refer to http://my.fit.edu/~vkepuska/ece5527/sctk-2.3-rc1/doc/infmts.htm#ctm_fmt_name_0 for the CTM format specification.