A presentation for PyConAU 2020: https://2020.pycon.org.au/program/99XMUR/
Sat September 05, 02:30 PM–02:55 PM
By 2025, there will be over 8 billion voice assistants in use. Speech recognition, chatbots, virtual assistants and smart speakers are all types of voice assistant. But as with many other technologies, issues of bias in the intent, design, execution and evolution of voice assistants are evident.
Many voice assistants today fail to accurately recognise speakers who have accents, or who speak lesser-known languages. Synthesised voices represent well known languages only. There are a range of reasons for this - the under-representation of minorities in technology, commercial drivers and under-resourced languages.This talk will take the audience on a tour of these issues, highlighting the open source efforts in the field that provide opportunities to redress this state of affairs.
By 2025, there will be over 8 million voice assistants in the world. They are found on your mobile phone, in your home, in your car, and over time, will be embedded in many cyber-physical systems across the world. At the same time, there are over 7000 languages spoken in the world - "living languages".
But voice assistants support just a fraction of these languages. Moreover, accents and diversity within a spoken language are not well handled by voice assistants. For example, African American voices are much less likely to be correctly recognised by the speech recognition algorithms used within voice assistants. And as we start to interact with systems using voice, we have a human desire to listen to voices we resonate with. Voices like us. For many people, there are no synthesised voices that reflect their heritage, language, and gender expression.
There are several techno-social reasons behind this state of affairs.
-
The intent of a commercial voice assistant is to make money. This drives technical development in certain ways, such as certain languages being seen as more lucrative than others, irrespective of the number of speakers of that language. For example, there is more voice assistant support for Icelandic, a language spoken by 314,000 people, than there is for Kiswahili, a language spoken by over 100,000,000 people in Eastern Africa. Why? Money.
-
The big tech companies behind voice assistants have typically poor gender and racial diversity in their talent pool. Diversity in developers leads to diversity in development.
-
The data used for training speech recognition and speech synthesis models often has racial and gender biases. These can stem from both selection bias, but also broader systemic issues of inequality, such as the use of voice assistant technology to gather data - and the affordability of both that technology and its pre-requisites, such as internet access.
-
Many languages are considered "low resource languages". This means they often don't have written transcriptions, which are needed to train machine learning models. Those creating transcriptions often face the "transcription bottleneck" - a workflow impediment that means the creation of resources consumes significant labour time.
There are many established and emerging open source tools - many in Python - and movements that individually are addressing aspects of this broader techno-social system. Together, they can effect change so that everyone, everywhere can be afforded the benefits of voice technology.
she/her
Kathy Reid works at the intersection of open source, emerging technologies and the communities that bring them to life. She has twenty years' experience in development, developer and technical leadership and management roles across education and emerging technology.
She is currently with Mozilla's Voice team, and is doing a PhD with ANU's 3A Institute on how open voice technology goes to scale.