Skip to content

NISQA Corpus

Gabriel edited this page May 3, 2021 · 4 revisions

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.

The NISQA Corpus contains two training, two validation and four test datasets:

  • NISQA_TRAIN_SIM and NISQA_VAL_SIM: contains simulated distortions with speech samples from four different datasets. Divided into a training and a validation set.

  • NISQA_TRAIN_LIVE and NISQA_VAL_LIVE: contains live phone and Skype recordings with Librivox audiobook samples. Divided into training and validation set.

  • NISQA_TEST_LIVETALK: contains recordings of real phone and VoIP calls.

  • NISQA_TEST_FOR: contains live and simulated conditions with speech samples from the forensic speech dataset.

  • NISQA_TEST_NSC: contains live and simulated conditions with speech samples from the NSC dataset.

  • NISQA_TEST_P501: contains live and simulated conditions with speech samples from ITU-T Rec. P.501.

The datasets are provided under the original terms of the used source speech and noise samples. Please see the individual readme and license files in each of the dataset folders within the NISQA_Corpus.zip for more details about the datasets and the licenses. Generally, all of the files in this corpus can be used for non-commercial research purposes and some of the datasets can be also be used for commercial purposes.

If you use any of these datasets please cite following publication:
G. Mittag, B. Naderi, A. Chehadi, and S. Möller “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” arXiv:2104.09494 [eess.AS], 2021.

NISQA_Corpus.zip: Download | Mirror


NISQA_TRAIN_SIM and NISQA_VAL_SIM

These datasets contain a large variety of different simulated distortions:

  • Additive white Gaussian noise
  • Signal correlated MNRU noise.
  • Randomly sampled noise clips taken from the DNS-Challenge dataset
  • Lowpass / highpass / bandpass / arbitrary filter with random cutoff frequencies
  • Amplitude clipping
  • Speech level changes
  • Codecs in all available bitrade modes: AMR-NB, AMR-NB, G.711, G.722, EVS, Opus
  • Codec tandem and triple tandem
  • Packet-loss conditions with random and bursty patterns.
  • Combinations of the different distortions

The number of file equals the number of conditions in the datasets because each file was processed with a different condition. The original distortion parameters used to create the dataset are stored in the per-file csv-file. The resulting speech files were then split into a training and a validation set.

Information:
NISQA_TRAIN_SIM
Files: 10,000
Individual speakers: 2,322

NISQA_VAL_SIM
Files: 2,500
Individual speakers: 938

Files per condition: 1
Votes per file: ~5
Language: English

Source speech samples:
The source speech samples are taken from four different datasets. The source of each speech file is listed in the 'source' column of the per-file csv-file. The samples were segmented into 6-12 seconds clips.

  1. The Librivox audiobook clips of the "DNS-Challenge" [1] dataset. The Librivox audiobooks are part of the public domain (https://librivox.org/; License: https://librivox.org/pages/public-domain/).

  2. TSP speech database. The files are covered by a permissive Simplified BSD licence (see tsp_license.txt)

  3. Crowdsourced high-quality UK and Ireland English Dialect speech data set [3]. The dataset is covered by a "Attribution-ShareAlike 4.0 International" license (see ukire_license.txt).

  4. AusTalk [4], from which 6-12 clips of the interview task were extracted. The AusTalk license terms can be found in AusTalk_Content_Licence_Terms.pdf. The owners of the AusTalk corpus gave us permission for making this dataset publicly available.

Noise files:
Noise files are taken from the DNS-Challenge [1] dataset (https://github.com/microsoft/DNS-Challenge), which in turn are taken from these three datasets:
Audioset: https://research.google.com/audioset/index.html; License: https://creativecommons.org/licenses/by/4.0/
Freesound: https://freesound.org/ Only files with CC0 licenses were selected; License: https://creativecommons.org/publicdomain/zero/1.0/
Demand: https://zenodo.org/record/1227121#.XRKKxYhKiUk; License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA

License:
The dataset is provided under the original terms of the used source speech and noise samples.

[1] C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective speechquality and testing framework,” 2020.
[2] P. Kabal, “TSP speech database,” McGill University, Quebec, Canada, Tech. Rep. Database Version 1.0, 2002.
[3] I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open-source multi-speaker corpora of the english accents in the britishisles,” inProc. 12th Language Resources and Evaluation Confer-ence (LREC), 2020.
[4] D. Burnham, D. Estival, S. Fazio, J. Viethen, F. Cox, R. Dale,S. Cassidy, J. Epps, R. Togneri, M. Wagner, Y. Kinoshita, R. G̈ocke, J. Arciuli, M. Onslow, T. W. Lewis, A. Butcher, andJ. Hajek, “Building an audio-visual corpus of Australian English:Large corpus collection with an economical portable and replica-ble black box,” inProc. Interspeech 2011, 2011


NISQA_TRAIN_LIVE and NISQA_VAL_LIVE

In this dataset live telephone and Skype calls were conducted, where clean speech files from the DNS-Challenge dataset were played back via a loudspeaker. The speech signal was played back from a laptop on a Fostex PM0.4n studio monitor. Two types of calls were conducted: a fixed-line to mobile phone call, and a Skype call (laptop to laptop). For the first type, a call from a fixed-line VoIP phone (Cisco IP Phone 9790) within the Q&U Lab to a state-of-the-art smartphone (Google Pixel 3) was conducted. The VoIP handset was placed in front of the monitor to capture the speech signal acoustically. The received signal was then stored directly on the Google Pixel 3. The Skype call was conducted between two laptops, where the sending laptop was placed next to the monitor to capture the played back speech signal. The transmitted speech signal was then stored on the receiving laptop. During the call, several real distortions were created in the recording room, such as open window, changing volume and angle of monitor, typing on keyboard. The resulting speech files were then split into a training and a validation set. The same speakers that were used in the simulated training dataset were again used for the live training dataset and vice versa for the validation set. However, new sentences of these speakers that are not contained in the training dataset were used.

Information:
NISQA_TRAIN_LIVE
Files: 1020
Individual speakers: 486

NISQA_VAL_LIVE
Files: 200
Individual speakers: 102

Files per condition: 1
Votes per file: ~5
Votes per condition: ~5
Language: English

Source speech samples:
The source speech samples are taken from the Librivox audiobook clips of the "DNS-Challenge" [1] dataset. The Librivox audiobooks are part of the public domain (https://librivox.org/; License: https://librivox.org/pages/public-domain/). The samples from this dataset were segmented into 6-12 seconds clips.

License:
The dataset is provided under the original terms of the used source speech samples. Therefore, they may be used for commercial and/or non-commerical research.

[1] C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective speechquality and testing framework,” 2020.


NISQA_TEST_LIVETALK

In this live talking dataset, the talkers spoke directly into the terminal device (i.e. a smartphone or laptop). The test participants were instructed to talk loudly, quietly, with loudspeaker, or music in the background to obtain different test scenarios and speech quality distortions. Depending on the condition the talkers were located in different environments, such as in a café, inside a car on the highway, inside a building with poor reception, elevator, shopping centre, subway/metro station, on a busy street, etc. Most of the talkers used their mobile phone to call either through the mobile network or with a VoIP service (Skype/Facebook). The calls were recorded on a Laptop for the VoIP calls and on a Google Pixel 3 for the mobile phone calls. The conversations were either spontaneous or based on scenarios taken from ITU-T P.805. Then 6-12 seconds segments were extracted from the conversations and rated regarding their overall quality and speech quality dimensions.

Information:
Files: 232
Individual speakers: 8 (4 male/4 female)
Conditions: 58
Files per condition: 4
Votes per file: 24
Votes per condition: 96
Language: German

License:
This dataset is made available for research purposes under Creative Commons Attribution 4.0 International (CC BY 4.0) license.


NISQA_TEST_FOR

This dataset contains simulated distortions with different codecs, background noises, packet-loss, clipping. It also contains live conditions with WhatsApp, Zoom, and Discord (see condition list for more information). The dataset was annotated with overall quality and speech quality dimension ratings in the crowd according to ITU-T P.808.

Information:
Files: 240
Individual speakers: 80 (40 male/40 female)
Conditions: 60
Files per condition: 4
Votes per file: ~30 (50 before filtering crowd ratings)
Votes per condition: ~117
Language: Australian English

Source speech samples:
The source speech samples are taken from the "Forensic Voice Comparison Databases - Australian English: 500+ speakers" dataset [1] [2]. The database is available for non-commercial research and forensic casework. The conversation samples from this dataset were segmented into 6-12 seconds clips.

Noise files:
Noise files are taken from the DNS-Challenge [3] dataset (https://github.com/microsoft/DNS-Challenge), which in turn are taken from these three datasets:
Audioset: https://research.google.com/audioset/index.html; License: https://creativecommons.org/licenses/by/4.0/
Freesound: https://freesound.org/ Only files with CC0 licenses were selected; License: https://creativecommons.org/publicdomain/zero/1.0/
Demand: https://zenodo.org/record/1227121#.XRKKxYhKiUk; License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA

License:
The dataset is provided under the original terms of the used source speech and noise samples. Therefore, the files may only be used for non-commerical research and forensic casework. The owner of the original forensic speech sample dataset gave us permission for making this dataset publicly available.

[1] G. Morrison, P. Rose, and C. Zhang, “Protocol for the collectionof databases of recordings for forensic-voice-comparison researchand practice,”Australian Journal of Forensic Sciences, vol. 44, pp.155 – 167, 2012.
[2] G. Morrison, C. Zhang, E. Enzinger, F. Ochoa, D. Bleach,M. Johnson, B. Folkes, S. De Souza, N. Cummins, and D. Chow.(2015) Forensic database of voice recordings of 500+ australianenglish speakers. [Online]. Available: http://databases.forensic-voice-comparison.net/
[3] C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective speechquality and testing framework,” 2020.


NISQA_TEST_NSC

This dataset contains simulated distortions with different codecs, background noises, packet-loss, clipping. It also contains live conditions with Skype, Zoom, and Google Meet, and mobile-to-landline (see condition list). The dataset was annotated with overall quality and speech quality dimension ratings in the crowd according to ITU-T P.808.

Information:
Files: 240
Individual speakers: 240 (120 male/120 female)
Conditions: 60
Files per condition: 4
Votes per file: ~27 (50 before filtering crowd ratings)
Votes per condition: ~109
Language: German

Source speech samples:
The source speech samples are taken from the "Nautilus Speaker Characterization (NSC) Corpus" dataset [1] [2]. The database is available for non-commercial research and teaching purposes only. See the "NSC_License_CLARIN_ACA_BY_NC_NORED.pdf" document for the dataset license. The samples from this dataset were segmented into 6-12 seconds clips.

Noise files:
Noise files are taken from the DNS-Challenge [3] dataset (https://github.com/microsoft/DNS-Challenge), which in turn are taken from these three datasets:
Audioset: https://research.google.com/audioset/index.html; License: https://creativecommons.org/licenses/by/4.0/
Freesound: https://freesound.org/ Only files with CC0 licenses were selected; License: https://creativecommons.org/publicdomain/zero/1.0/
Demand: https://zenodo.org/record/1227121#.XRKKxYhKiUk; License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA

License:
The dataset is provided under the original terms of the used source speech and noise samples. Therefore, the files may only be used for non-commerical research. The owner of the original NSC speech sample dataset gave us permission to make this dataset available for non-commercial research purposes.

[1] Fernández Gallardo, L. and Weiss, B., "The Nautilus Speaker Characterization Corpus: Speech Recordings and Labels of Speaker Characteristics and Voice Descriptions," in International Conference on Language Resources and Evaluation (LREC), 2018.
[2] https://www.qu.tu-berlin.de/?id=nsc-corpus
[3] C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective speechquality and testing framework,” 2020.


NISQA_TEST_P501

This dataset contains simulated distortions with different codecs, background noises, packet-loss, clipping. It also contains live conditions with Skype, Zoom, WhatsApp, and mobile network recordings (see condition list). The dataset was annotated with overall quality and speech quality dimension ratings in the crowd according to ITU-T P.808. If you use this dataset please cite following publication:

Information:
Files: 240
Individual speakers: 4 (2 male/2 female)
Conditions: 60
Files per condition: 4
Votes per file: ~28 (50 before filtering crowd ratings)
Votes per condition: ~113
Language: British English

Source speech samples: The source speech samples are taken from Annex C of the "ITU-T P.501" dataset [1]. See "itut_p501_license.txt" for the dataset license. The samples from this dataset were segmented into 6-12 seconds clips.

Noise files:
Noise files are taken from the DNS-Challenge [2] dataset (https://github.com/microsoft/DNS-Challenge), which in turn are taken from these three datasets:
Audioset: https://research.google.com/audioset/index.html; License: https://creativecommons.org/licenses/by/4.0/
Freesound: https://freesound.org/ Only files with CC0 licenses were selected; License: https://creativecommons.org/publicdomain/zero/1.0/
Demand: https://zenodo.org/record/1227121#.XRKKxYhKiUk; License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA

License:
The dataset is provided under the original terms of the used source speech and noise samples. The ITU is the copyright owner of the original test signals. The use of the P.501 speech samples for this dataset is made under permission by ITU. The speech quality dataset is made available to the public for free and shall not be included in any commercial product/service.

[1] ITU-T Rec. P.501: Test signals for use in telephony and other speech-based applications, 2020.
[2] C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective speechquality and testing framework,” 2020.