In 2023, despite the numerous advancements in generative AI systems, the synthetic assistants on our mobile devices are still as hard of hearing as they were in 2011. Siri, for example, still struggles to accurately recognize and transcribe speech. However, Meta AI, a company dedicated to improving automatic speech recognition (ASR) tools, has developed a new dataset that promises to enhance the performance of these systems by clustering speech at the “utterance level.”
Meta AI has long been committed to improving the capabilities of its ASRs. They have focused on training the systems without the aid of transcripts and enabling them to recognize over 4,000 spoken languages. Additionally, Meta AI has developed ASR models that can read lips with greater proficiency than human experts. Despite these advancements, the datasets used to train ASR models have often been organized by demographic categories such as age group, gender, nationality, and English accent. This limited variation hinders the models’ ability to understand a broad cross-section of users and their unique pronunciations.
To overcome these limitations, Meta AI has developed a dataset that clusters speech at the utterance level. Rather than categorizing the data based on speakers’ demographic information, the dataset groups together similar utterances from a diverse group of speakers. By training their models using these various clusters, Meta AI can measure how the model impacts outcomes across different demographic groups using fairness datasets.
The resulting dataset created by Meta AI consists of over 27,000 command utterances collected from 595 paid US volunteers. These utterances revolve around seven main themes, including music, capture, utilities, notification control, messaging, calling, and dictation. Other researchers can use this dataset to train their own models and digital assistants. The prompts given to the speakers involved voice searching for a song or making plans with friends, among other tasks.
To evaluate the effectiveness of this new system, Meta AI initially trained a model using publicly-available English-language Facebook videos. The researchers then tested the model using two other datasets: Casual Conversations v1, released by Meta AI in 2021, and a de-identified dataset collected from a data supplier for ASR, which contained 48,000 spoken utterances from 867 individuals.
The initial results of this evaluation were promising. The model showed improvements in performance across all demographic groups in the evaluation datasets, with the most significant gains seen in accent inclusivity. Overall, the ASR performance increased by 10% using the utterance clustering method. Notably, there were significant improvements seen in the age group between 66 and 85, who are often underrepresented in the voice command space.
The researchers at Meta AI emphasized that their proposed algorithm is part of their long-term commitment to responsible AI and addressing fairness issues. It is just one aspect of their holistic approach to ensuring their AI systems are inclusive and unbiased. Looking ahead, the team plans to explore adapting the system to other languages, further expanding the accessibility and inclusivity of their ASR models.
In conclusion, Meta AI’s new dataset and utterance clustering method offer promising advancements in improving the performance of automatic speech recognition tools. By focusing on the diversity of utterances rather than demographic categories, Meta AI aims to foster inclusivity and accuracy in their AI systems. With further exploration and adaptation, these advancements have the potential to enhance voice recognition technologies in various languages and domains, ultimately providing a more comprehensive and inclusive user experience.