Skip to content

Latest commit

 

History

History
executable file
·
40 lines (40 loc) · 2.79 KB

transcript.md

File metadata and controls

executable file
·
40 lines (40 loc) · 2.79 KB

SLIDE: 2021 It is 2021. Or, if you're on pandemic time, the 52nd week of March 2020. SLIDE: 7.8 billion There are nearly 8 billion people on the planet right now. SLIDE: 7139 And they speak about 7000 languages SLIDE: 200 But only around 200 of those languages have some form of speech recognition support - that is, only around 200 language have supporting technology that allows transcription. SLIDE: 55 of 135 And of those 200 or so languages, the majority are languages spoken in affluent or rich countries. For example, of the 135 languages supported by Google speech to text, 55 are variants of just three languages - english, spanish and arabic. SLIDE: Indigenous languages Indigenous languages don't get much of a look in, which is disappointing considering some Indigenous languages are spoken by millions of people - like Kinyarwanda, spoken by 12 million people in Rwanda. This is one of the reasons why the United Nations declared 2019 the Year of Indigenous Languages, and has declared 2022-2031 the Decade of Indigenous Languages. SLIDE: What about DeepSpeech So what does this have to do with DeepSpeech? Great question, I'm so glad you asked SLIDE: Two functions DeepSpeech has two key functions. It can be used for inference, which is where speech recognition is performed from a trained model. And DeepSpeech can also be used for training that model. The way DeepSpeech trains a model uses a new type of algorithm. SLIDE: seq2seq DeepSpeech is a sequence to sequence speech recognition engine. This means it doesn't use phonemes. What are phonemes? Also a great question. SLIDE: Phonemes Phonemes are the building blocks of sound in a language. For example, the English language has 44 phonemes, and every word in English is made up of one or more of these phonemes. SLIDE: acoustic model Earlier types of speech recognition engines used two types of model. They would first use an acoustic model which recognised phonemes from a spoken phrase - like this. SLIDE: language model and then they would use a language model to turn the phonemes into a word, like this. SLIDE: seq2seq Sequence to sequence algorithms, like DeepSpeech, learn to go straight from a spoken phrase to a transcription. Like this! But why is this important for under-served languages? SLIDE: alphabet.txt During training DeepSpeech is fed an alphabet.txt file to train from. SLIDE: examples of alphabet Any Unicode character can be in the alphabet.txt file. That means we can train it on Kabyle, or Swahili, or Polish, or Welsh. Without a language model. It learns the words from transcriptions. SLIDE: Kinyarwanda And it means that if you can mobilise a language community, and record thousands of phrases, and thousands of hours of speech, you can train a speech recognition model. SLIDE: 201 And make a small difference in the world.