Developers across many industries now use automatic speech recognition (ASR) to increase business productivity, application efficiency, and even digital accessibility. In a Q&A session with Naviiina, Prof. V Ramasubramanian, who is spearheading “Automatic Speech Recognition (ASR) and Speech Research” at IIIT Bangalore shares in-depth details about ASR.
What is speech recognition technology?
Speech recognition involves converting (by humans or a machine) spoken signal (i.e., simply ‘speech’ in the form of acoustic waveform) into the constituent sequence of words or other meaningful linguistic units such as phones, syllables, characters etc of the language of the spoken signal. This is also referred to as ‘transcription’ of the input speech signal into corresponding linguistic words in textual form, hence being also referred to as simply ‘speech-to-text’ conversion.
What is Automatic Speech Recognition (ASR)? Since when has ASR gained momentum?
Automatic Speech Recognition (ASR) dates back to 1950s starting with the early work at Bell Laboratories, USA. Over the years, ASR has evolved in all dimensions of the science and technology aspects, such as with regard to speech signal processing, time-frequency spectrographic representations and interpretations, feature extraction, acoustic modelling of the speech signal, pattern matching algorithms, sequence decoding algorithms etc. This progression is marked by highly enhanced performances of the ASR systems typically in terms of word-error-rate (WER). ASR has evolved in two major leaps: one corresponding to reaching WERs of 40% for conversational speech using the erstwhile HMM techniques over the period of 1980s to early 2000s, and the second involving the breakthrough in machine learning and deep learning with deep neural architectures, allowing ASR performances to reach down to sub-10% WERs on conversational speech, and even human-parity performances (~4% WERs) on select tasks, as measured over standard data-sets. These state-of-the-art techniques include machine learning and deep learning algorithms such as DNN-HMMs, Encoder-Decoder-CTC and Attention-based pipelines and RNN-Transducers and pipelines involving Transformers and Conformers. These techniques have made ASR deployment-ready on the field such as in voice-assistant and voice-search applications, as typified by commercial systems like Google Voice Search, Google Home, Amazon Alexa, Apple Siri, OpenAI Whisper, Samsung Bixby etc.
Can you shed some light on your research work in ASR at IIITB?
Speech recognition by machines (i.e., computer, presently in the form of a server on a cloud or on-edge devices such as mobile phones or in-car on-board computers) is termed ‘automatic speech recognition’ (ASR). ASR requires a machine to apply a suite of algorithms on the input speech signal representing the acoustics of ‘what’ has been spoken by a human. These algorithms typically comprise a sequence of processing stages (or a pipeline) starting with a) digital signal processing techniques to pre-process the digitized speech signal, acquired via a microphone and digitized by an analog-to-digital convertor (ADC), to remove and reduce any background (or channel) noise, b) specialized speech-signal processing techniques to extract short-time features that represent the perceptual and/or production aspects of the signal and c) finally algorithms, currently in the form of deep-learning neural-networks, to convert the sequence of features derived from the feature extraction stage into a sequence of linguistic units that can eventually represent a ‘transcription’ of the input speech signal into a sequence of words from a specific vocabulary in a specific language. ASR systems are typically speaker-independent in the sense of being capable of working on speech from an arbitrary speaker. Some of the dimensions of ASR are a) the language, b) the vocabulary, c) performance of the ASR system typically in terms of word-error-rate (WER) and d) and in case of systems working in real-time, a measure termed the real-time factor representing the ‘latency’ in transcribing the input speech, measuring how long it takes for the machine to process 1 sec of input speech.
Yes, IIITB is conducting research in ASR. It has an active research program in ASR with several former and current research students (MS and PhD) focusing on various aspects of ASR. This includes: a) Multi-lingual ASR for Indian languages, b) End-to-end ASR, c) Unsupervised representation learning for downstream ASR tasks in low-resource settings, d) Few-shot learning (FSL) for E2E ASR, e) Application of cross-domain FSL to various cross-domain settings in ASR (such as cross-corpus, cross-lingual, cross-accent, cross-dialect etc), e) Analysis-by-synthesis formulations for ASR, f) Associative memory formulations and application to multi-modal learning and ASR.
The relevance of our research in current settings comes from our overarching focus towards realizing ASR systems and underlying techniques specifically towards low-resource settings, especially Indian languages. We identify and define low-resource conditions as one where a spoken language has at most 10s to 100s of hours of data. This contrasts with scenarios of high-resource settings where it is now common practice to use 1000s and 10000s hours of data to build and deploy ASR systems for real-world applications such as personal assistants (Google voice assistant, Google Home, Amazon Alexa, Apple Siri, OpenAI Whisper, Samsung Bixby etc.).
Apart from infotainment, which are the other sectors in which speech recognition is extensively used?
ASR has a more far-reaching impact in all applications (beyond ‘infotainment’) requiring natural human-computer interaction, such as in settings where voice is the main (or only) modality of information exchange or communication. Examples include
- Voice-search (searching the WWW by speaking into the search engine)
- In-car control (for climate control, multi-media control etc. via a dash-board interface in the car)
- Call-centre automation (where a user can engage with an automatic agent to perform a wide variety of information-retrieval and task-completion dialogues such as typically done with a human call-centre agent)
- Conversational AI with speech/text modalities in the form of chat-bots that resolve queries in specific domains (e.g. banking) and Q&A systems such as the now emerging Google Lamda and OpenAI ChatGPT (which are currently enabled with textual input),
- Medical and legal transcription, where the spoken records created by a physician (/law agents) need to be transcribed quickly into regulatory-compliant text records),
- Navigating and controlling devices such as mobile phones, TVs (e.g. Amazon fire-stick), IoTs (home automation units),
- Meeting capture applications such as in an on-line multi-party meeting, requiring annotation of ongoing multi-speaker conversations for diarization purposes (who said what and when),
- Office dictation tasks,
- YouTube audio / spoken content closed-captioning etc.
Will ASR help humans and machines will collaborate seamlessly in the future?
In further years and decades to come, ASR performances are bound to improve significantly, being able to solve as yet challenging problems such as dealing with high background noise, channel distortions, accented speech, dialectal variations, large speaker variability, low resource conditions etc – all of which are far easier tasks for humans, and have a high-performance benchmark in human listening and hearing. Breakthroughs in all-neural architectures and machine learning techniques are bound to make such improvements very likely to the extent of making ASR systems widely deployable, in a more ubiquitous manner than it is at present. Another direction likely to emerge is the realization of ‘robust’ conversational AI systems, with the coupling of ASR with highly effective back-end NLP techniques, to create full-fledged speech-enabled dialog systems capable of carrying out information retrieval tasks currently in the realm of human agents such as in call-center settings.
In combination with ‘robust’ back-end NLP techniques (such as ChatGPT or Google Lamda) it is only a question of time before we have ‘conversational AI’ systems that could potentially ‘pass’ the Turing test of ‘intelligence’ in machines a kind of holy-grail for researchers in AI attempting to build human-like machines capable of mastering the very unique cognitive domain of humans – namely, speech and language capabilities.
A holy grail of much ASR work would be in the direction of incorporating such human cognitive mechanisms in computational frameworks that have a high ‘biological realism’. This could be several decades into the future from now. Such understanding and incorporation of human-cognitive mechanisms will inherently also attempt to solve the as yet difficult problems of coping with high speaker variability, accent and dialectal variability, multi-lingual scenario and ultra-low-resource settings, in a way trying to mimic how a child acquired and masters a language in a very short time with very little data.