index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Speech Research</title>
    <link>https://speechresearch.github.io/</link>
    <description>Recent content on Speech Research</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Sat, 02 Apr 2022 08:02:17 +0000</lastBuildDate><atom:link href="http://localhost:1313/index.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis</title>
      <link>https://speechresearch.github.io/binauralgrad/</link>
      <pubDate>Sun, 29 May 2022 14:59:32 +0800</pubDate>
      
      <guid>https://speechresearch.github.io/binauralgrad/</guid>
      <description>Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples.Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61).</description>
    </item>

    <item>
      <title>NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality</title>
      <link>https://speechresearch.github.io/naturalspeech/</link>
      <pubDate>Tue, 03 May 2022 19:59:32 +0800</pubDate>
      
      <guid>https://speechresearch.github.io/naturalspeech/</guid>
      <description>Abstract Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge human-level quality and how to achieve it. In this paper, we answer these questions by first defining the criterion of human-level quality based on statistical significance of measurement and describing the guidelines to judge it, and then proposing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.</description>
    </item>
    
    <item>
      <title>Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech</title>
      <link>http://localhost:1313/mpbert/</link>
      <pubDate>Sat, 02 Apr 2022 08:02:17 +0000</pubDate>
      <guid>http://localhost:1313/mpbert/</guid>
      <description>ArXiv: arXiv:2203.17190 Authors Guangyan Zhang (The Chinese University of Hong Kong) gyzhang@link.cuhk.edu.hk Kaitao Song (Microsoft Research Asia) Xu Tan (Microsoft Research Asia) xuta@microsoft.com Daxin Tan (The Chinese University of Hong Kong) Yuzi Yan (Tsinghua University) Yanqing Liu (Microsoft Azure Speech) Gang Wang (Microsoft Azure Speech) Wei Zhou (Zhejiang University) Tao Qin (Microsoft Research Asia) Tan Lee (The Chinese University of Hong Kong) Sheng Zhao (Microsoft Azure Speech) Abstract Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention.</description>
    </item>
    
    <item>
      <title>AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios</title>
      <link>https://speechresearch.github.io/adaspeech4/</link>
      <pubDate>Sun, 06 Mar 2022 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/adaspeech4/</guid>
      <description>Authors  Yihan Wu (Gaoling School of Artificial Intelligence, Renmin University of China) yihanwu@ruc.edu.cn Xu Tan^ (Microsoft Research Asia) xuta@microsoft.com Bohan Li (Microsoft Azure Speech) bohli@microsoft.com Lei He (Microsoft Azure Speech) helei@microsoft.com Sheng Zhao (Microsoft Azure Speech) szhao@microsoft.com Ruihua Song^ (Gaoling School of Artificial Intelligence, Renmin University of China) rsong@ruc.edu.cn Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Tie-Yan Liu (Microsoft Research Asia) tyliu@microsoft.com  ^ Corresponding authors.
Abstract Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently, by using a well-trained source TTS model without adapting it on the speech data of new speakers.</description>
    </item>
    
    <item>
      <title>Speech-T: Transducer for Text to Speech and Beyond</title>
      <link>https://speechresearch.github.io/speechtransducer/</link>
      <pubDate>Wed, 06 Oct 2021 21:35:07 +0800</pubDate>
      
      <guid>https://speechresearch.github.io/speechtransducer/</guid>
      <description>Authors  Jiawei Chen (South China University of Technology) csjiaweichen@mail.scut.edu.cn Xu Tan (Microsoft Research Asia) xuta@microsoft.com Yichong Leng (University of Science and Technology of China) lyc123go@mail.ustc.edu.cn Jin Xu (Tsinghua University) j-xu18@mails.tsinghua.edu.cn Guihua Wen (South China University of Technology) crghwen@scut.edu.cn Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Tie-Yan Liu (Microsoft Research Asia) tyliu@microsoft.com  Abstract Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs.</description>
    </item>
    
    <item>
      <title>TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method</title>
      <link>https://speechresearch.github.io/telemelody/</link>
      <pubDate>Tue, 21 Sep 2021 12:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/telemelody/</guid>
      <description>ArXiv: arXiv:2109.09617
Authors  Zeqian Ju (University of Science and Technology of China) juzeqian@mail.ustc.edu.cn Peiling Lu (Microsoft Research Asia) peil@microsoft.com Xu Tan^ (Microsoft Research Asia ) xuta@microsoft.com Rui Wang (Microsoft Research Asia ) ruiwa@microsoft.com Chen Zhang (Zhejiang University ) zc99@zju.edu.cn Songruoyao Wu (Zhejiang University ) 22021296@zju.edu.cn Kejun Zhang (Zhejiang University ) zhangkejun@zju.edu.cn Xiangyang Li (University of Science and Technology of China ) xiangyangli@ustc.edu.cn Tao Qin (Microsoft Research Asia ) taoqin@microsoft.</description>
    </item>
    
    <item>
      <title>DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling</title>
      <link>https://speechresearch.github.io/deeprapper/</link>
      <pubDate>Mon, 16 Aug 2021 12:25:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/deeprapper/</guid>
      <description>ArXiv: arXiv:2107.01875
Authors  Lanqing Xue (Hong Kong University of Science and Technology) lxueaa@cse.ust.hk Kaitao Song (Nanjing University of Science and Technology) kt.song@njust.edu.cn Duocai Wu (Fudan University) dcwu18@fudan.edu.cn Xu Tan^ (Microsoft Research) xuta@microsoft.com Nevin L. Zhang (Hong Kong University of Science and Technology) lzhang@cse.ust.hk Wei-Qiang Zhang (Tsinghua University) wqzhang@tsinghua.edu.cn Tao Qin (Microsoft Research) taoqin@microsoft.com Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  ^ Corresponding author.
Abstract Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms.</description>
    </item>
    
    <item>
      <title>PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior</title>
      <link>https://speechresearch.github.io/priorgrad/</link>
      <pubDate>Fri, 11 Jun 2021 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/priorgrad/</guid>
      <description>ArXiv: arXiv:2106.06406
Authors  Sang-gil Lee (Data Science &amp;amp; AI Lab., Seoul National University) tkdrlf9202@snu.ac.kr Heeseung Kim (Data Science &amp;amp; AI Lab., Seoul National University) gmltmd789@snu.ac.kr Chaehun Shin (Data Science &amp;amp; AI Lab., Seoul National University) chaehuny@snu.ac.kr Xu Tan^ (Microsoft Research Asia) xuta@microsoft.com Chang Liu (Microsoft Research Asia) changliu@microsoft.com Qi Meng (Microsoft Research Asia) meq@microsoft.com Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Wei Chen (Microsoft Research Asia) wche@microsoft.com Sungroh Yoon^ (Data Science &amp;amp; AI Lab.</description>
    </item>
    
    <item>
      <title>AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style</title>
      <link>https://speechresearch.github.io/adaspeech3/</link>
      <pubDate>Wed, 02 Jun 2021 11:35:23 +0800</pubDate>
      
      <guid>https://speechresearch.github.io/adaspeech3/</guid>
      <description>Author  Yuzi Yan (EE, Tsinghua University) yan-yz17@mails.tsinghua.edu.cn Xu Tan (Microsoft Research Asia) xuta@microsoft.com Bohan Li (Microsoft Azure Speech) bohan.li@microsoft.com Guangyan Zhang (EE, The Chinese University of Hong Kong) gyzhang@link.cuhk.edu.hk Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Sheng Zhao (Microsoft Azure Speech) sheng.zhao@microsoft.com Yuan Shen (EE, Tsinghua University) shenyuan_ee@tsinghua.edu.cn Wei-Qiang Zhang (EE, Tsinghua University) wqzhang@tsinghua.edu.cn Tie-Yan Liu (Microsoft Research Asia) tie-yan.liu@microsoft.com  Audio Samples All of the audio samples use MelGAN as vocoder.</description>
    </item>
    
    <item>
      <title>AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data</title>
      <link>https://speechresearch.github.io/adaspeech2/</link>
      <pubDate>Fri, 05 Mar 2021 11:35:23 +0800</pubDate>
      
      <guid>https://speechresearch.github.io/adaspeech2/</guid>
      <description>Author  Yuzi Yan (EE, Tsinghua University) yan-yz17@mails.tsinghua.edu.cn Xu Tan (Microsoft Research Asia) xuta@microsoft.com Bohan Li (Microsoft Azure Speech) bohan.li@microsoft.com Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Sheng Zhao (Microsoft Azure Speech) szhao@microsoft.com Yuan Shen (EE, Tsinghua University) shenyuan_ee@tsinghua.edu.cn Tie-Yan Liu (Microsoft Research Asia) tie-yan.liu@microsoft.com  Audio Samples All of the audio samples use MelGAN as vocoder.
Audio Quality When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.</description>
    </item>
    
    <item>
      <title>AdaSpeech: Adaptive Text to Speech for Custom Voice</title>
      <link>https://speechresearch.github.io/adaspeech/</link>
      <pubDate>Mon, 01 Mar 2021 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/adaspeech/</guid>
      <description>ArXiv: arXiv:2103.00993
Authors  Mingjian Chen* (Microsoft Azure Speech) t-miche@microsoft.com Xu Tan^* (Microsoft Research Asia) xuta@microsoft.com Bohan Li (Microsoft Azure Speech) bohan.li@microsoft.com Yanqing Liu (Microsoft Azure Speech) yanqliu@microsoft.com Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Sheng Zhao (Microsoft Azure Speech) szhao@microsoft.com Tie-Yan Liu (Microsoft Research Asia) tyliu@microsoft.com  * Equal contribution. ^ Corresponding author.
Abstract Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech from her/him.</description>
    </item>
    
    <item>
      <title>FastSpeech 2: Fast and High-Quality End-to-End Text to Speech</title>
      <link>https://speechresearch.github.io/fastspeech2/</link>
      <pubDate>Wed, 10 Feb 2021 15:30:00 +0903</pubDate>
      
      <guid>https://speechresearch.github.io/fastspeech2/</guid>
      <description>ArXiv: arXiv:2006.04558
Authors  Yi Ren* (Zhejiang University) rayeren@zju.edu.cn Chenxu Hu* (Zhejiang University) chenxuhu@zju.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Tao Qin (Microsoft Research) taoqin@microsoft.com Sheng Zhao (Microsoft Azure Speech) Sheng.Zhao@microsoft.com Zhou Zhao (Zhejiang University) zhaozhou@zju.edu.cn Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  * Equal contribution.
Abstract Non-autoregressive text to speech (TTS) models such as FastSpeech~\citep{ren2019fastspeech} can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.</description>
    </item>
    
    <item>
      <title>SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint</title>
      <link>https://speechresearch.github.io/songmass/</link>
      <pubDate>Mon, 14 Dec 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/songmass/</guid>
      <description>ArXiv: arXiv:2012.05168
Authors  Zhonghao Sheng* (Peking University) zhonghao.sheng@pku.edu.cn Kaitao Song* (Nanjing University of Science and Technology) kt.song@njust.edu.cn Xu Tan^ (Microsoft Research) xuta@microsoft.com Yi Ren (Zhejiang University) rayeren@zju.edu.cn Wei Ye (Peking University) wye@pku.edu.cn Shikun Zhang (Peking University) zhangsk@pku.edu.cn Tao Qin (Microsoft Research) taoqin@microsoft.com  * Equal contribution. ^ Corresponding author.
Abstract Automatic song writing aims to compose a song (lyric and/or melody) by machine, which is an interesting topic in both academia and industry.</description>
    </item>
    
    <item>
      <title>LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search</title>
      <link>https://speechresearch.github.io/lightspeech/</link>
      <pubDate>Tue, 03 Nov 2020 12:00:00 +0801</pubDate>
      
      <guid>https://speechresearch.github.io/lightspeech/</guid>
      <description>Authors  Renqian Luo (University of Science and Technology of China) lrq@mail.ustc.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Rui Wang (Microsoft Research) ruiwa@microsoft.com Tao Qin (Microsoft Research) taoqin@microsoft.com Enhong Chen (University of Science and Technology of China) cheneh@ustc.edu.cn Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  Abstract Text to speech (TTS) has been broadly used to synthesize natural and intelligible speech in different scenarios. Deploying TTS in various end devices such as mobile phones or embedded devices requires extremely small memory usage and inference latency.</description>
    </item>
    
    <item>
      <title>DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling</title>
      <link>https://speechresearch.github.io/denoispeech/</link>
      <pubDate>Wed, 14 Oct 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/denoispeech/</guid>
      <description>ArXiv: arXiv:2012.09547 (Accepted by ICASSP2021)
Authors  Chen Zhang (Zhejiang University) zc99@zju.edu.cn Yi Ren (Zhejiang University) rayeren@zju.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Jinglin Liu (Zhejiang University) jinglinliu@zju.edu.cn Kejun Zhang (Zhejiang University) zhangkejun@zju.edu.cn Tao Qin (Microsoft Research) taoqin@microsoft.com Sheng Zhao (Microsoft STC Asia) Sheng.Zhao@microsoft.com Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  Abstract While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect.</description>
    </item>
    
    <item>
      <title>HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis</title>
      <link>https://speechresearch.github.io/hifisinger/</link>
      <pubDate>Wed, 02 Sep 2020 00:02:05 +0800</pubDate>
      
      <guid>https://speechresearch.github.io/hifisinger/</guid>
      <description>ArXiv: arXiv:2009.01776
Authors

Microsoft STC Asia &amp; Microsoft Research Asia --
Jiawei Chen (Microsoft STC Asia) t-jiawch@microsoft.com
Xu Tan* (Microsoft Research) xuta@microsoft.com
Jian Luan (Microsoft STC Asia) jianluan@microsoft.com
Tao Qin (Microsoft Research) taoqin@microsoft.com
Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com

*Corresponding author
Abstract
High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz, compared with 16kHz or 24kHz in speaking voices) with large range of frequency to convey expression and emotion.</description>
    </item>
    
    <item>
      <title>PopMAG: Pop Music Accompaniment Generation</title>
      <link>https://speechresearch.github.io/popmag/</link>
      <pubDate>Sat, 01 Aug 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/popmag/</guid>
      <description>Arxiv: https://arxiv.org/abs/2008.07703
Audio 

/
Melody (Input)
PopMAG
Ground-Truth


Sample 1
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.


Sample 2
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.


Sample 3
Your browser does not support the audio element.</description>
    </item>
    
    <item>
      <title>UWSpeech: Speech to Speech Translation for Unwritten Languages</title>
      <link>https://speechresearch.github.io/uwspeech/</link>
      <pubDate>Fri, 12 Jun 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/uwspeech/</guid>
      <description>ArXiv: arXiv:2006.07926 (Accepted by AAAI2021)
Authors  Chen Zhang (Zhejiang University) zc99@zju.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Yi Ren (Zhejiang University) rayeren@zju.edu.cn Tao Qin (Microsoft Research) taoqin@microsoft.com Kejun Zhang (Zhejiang University) zhangkejun@zju.edu.cn Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  Abstract Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training.</description>
    </item>
    
    <item>
      <title>MultiSpeech: Multi-Speaker Text to Speech with Transformer</title>
      <link>https://speechresearch.github.io/multispeech/</link>
      <pubDate>Sat, 09 May 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/multispeech/</guid>
      <description>--
FastSpeech: Fast, Robust and Controllable Text to Speech --
Authors  Mingjian Chen (Perking University) milk@pku.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Yi Ren (Zhejiang University) rayeren@zju.edu.cn Jin Xu (Tsinghua University) j-xu18@mails.tsinghua.edu.cn Hao Sun (Perking University) sigmeta@pku.edu.cn Sheng Zhao (Microsoft STC Asia) Sheng.Zhao@microsoft.com Tao Qin (Microsoft Research) taoqin@microsoft.com Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  TTS Audio Samples in the Paper Experiments on VCTK and LibriTTS VCTK speaker : Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.</description>
    </item>
    
    <item>
      <title>Semi-Supervised Neural Architecture Search</title>
      <link>https://speechresearch.github.io/seminas/</link>
      <pubDate>Sun, 01 Mar 2020 18:00:00 +0801</pubDate>
      
      <guid>https://speechresearch.github.io/seminas/</guid>
      <description>ArXiv: arXiv:2002.10389
Authors  Renqian Luo (University of Science and Technology of China) lrq@mail.ustc.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Rui Wang (Microsoft Research) ruiwa@microsoft.com Tao Qin (Microsoft Research) taoqin@microsoft.com Enhong Chen (University of Science and Technology of China) cheneh@ustc.edu.cn Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  Abstract Neural architecture search (NAS) relies on a good controller to generate better architectures or predict the accuracy of given architectures. However, training the controller requires both abundant and high-quality pairs of architectures and their accuracy, while it is costly to evaluate an architecture and obtain its accuracy.</description>
    </item>
    
    <item>
      <title>DeepSinger: Singing Voice Synthesis with Data Mined From the Web</title>
      <link>https://speechresearch.github.io/deepsinger/</link>
      <pubDate>Fri, 14 Feb 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/deepsinger/</guid>
      <description>Authors  Yi Ren* (Zhejiang University) rayeren@zju.edu.cn Xu Tan* (Microsoft Research Asia) xuta@microsoft.com Tao Qin (Microsoft Research Asia) taoqin@microsoft.com Jian Luan (Microsoft STCA) jianluan@microsoft.com Zhou Zhao (Zhejiang University) zhaozhou@zju.edu.cn Tie-Yan Liu (Microsoft Research Asia) tyliu@microsoft.com  * Equal contribution.
Chinese 

/ 
Sample 1
Sample 2
Sample 3


Data crawling
Your browser does not support the audio element.
Lyrics: 爱从不容许人三心两意
Phonemes: PAD ai c ong b u r ong x v r en s an x in l iang PAD i
Your browser does not support the audio element.</description>
    </item>
    
    <item>
      <title>LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition</title>
      <link>https://speechresearch.github.io/lrspeech/</link>
      <pubDate>Sun, 02 Feb 2020 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/lrspeech/</guid>
      <description>ArXiv: arXiv:2008.03687
Authors  Jin Xu (Tsinghua University) j-xu18@mails.tsinghua.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Yi Ren (Zhejiang University) rayeren@zju.edu.cn Tao Qin (Microsoft Research) taoqin@microsoft.com Jian Li (Tsinghua University) lijian83@mail.tsinghua.edu.cn Sheng Zhao (Microsoft STC Asia) Sheng.Zhao@microsoft.com Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  Abstract Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training.</description>
    </item>
    
    <item>
      <title>FastSpeech: Fast, Robust and Controllable Text to Speech</title>
      <link>https://speechresearch.github.io/fastspeech/</link>
      <pubDate>Fri, 10 May 2019 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/fastspeech/</guid>
      <description>FastSpeech: Fast, Robust and Controllable Text to Speech --
ArXiv: arXiv:1905.09263
Reddit Discussions: Click me
Authors  Yi Ren* (Zhejiang University) rayeren@zju.edu.cn Yangjun Ruan* (Zhejiang University) ruanyj3107@zju.edu.cn Xu Tan (Microsoft Research) xuta@microsoft.com Tao Qin (Microsoft Research) taoqin@microsoft.com Sheng Zhao (Microsoft STC Asia) Sheng.Zhao@microsoft.com Zhou Zhao (Zhejiang University) zhaozhou@zju.edu.cn Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  * Equal contribution.
Abstract Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech.</description>
    </item>
    
    <item>
      <title>Almost Unsupervised Text to Speech and Automatic Speech Recognition</title>
      <link>https://speechresearch.github.io/unsuper/</link>
      <pubDate>Wed, 10 Apr 2019 15:30:00 +0901</pubDate>
      
      <guid>https://speechresearch.github.io/unsuper/</guid>
      <description>Paper: Almost Unsupervised Text to Speech and Automatic Speech Recognition
Authors  Yi Ren* (Zhejiang University) rayeren613@gmail.com Xu Tan* (Microsoft Research) xuta@microsoft.com Tao Qin (Microsoft Research) taoqin@microsoft.com Sheng Zhao (Microsoft) Sheng.Zhao@microsoft.com Zhou Zhao (Zhejiang University) zhaozhou@zju.edu.cn Tie-Yan Liu (Microsoft Research) tyliu@microsoft.com  * Equal contribution.
Abstract Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.</description>
    </item>
    
  </channel>
</rss>