Developing an Internal Wake Word Model: My Experience in AI Audio Engineering at Miko

Rituraj Bhattacharya
Jan 23, 2024
5 min read

During my college placement phase, I was faced with a choice that differed from the usual options pursued by my peers in Electrical and Electronics Engineering or consulting roles. My childhood had been enriched by the captivating tales of science fiction luminaries like Asimov, Philip K. Dick, and Clarke. These narratives probed the depths of consciousness and offered profound reflections on the human condition within the realm of fiction. My passion for such explorations led me towards a professional journey into the field of Artificial Intelligence.

My philosophy has always been to immerse myself in a problem to gain a deep understanding of it. I believe that practical experience can illuminate concepts that often remain elusive within the confines of theoretical study. Therefore, I chose to delve into the world of AI through hands-on AI audio engineering projects.

As I sent out numerous job applications, serendipity smiled upon me when one of them reached Miko. At that juncture, Miko was in the early stages of establishing an AI Audio Team with the aim of reducing, and eventually eliminating, their dependence on external companies for audio-related work, which incurred substantial costs. Given that my projects were centered around Audio Models, this presented a mutually beneficial opportunity for both the company and me. Consequently, I joined Miko as an AI Audio and Speech Engineer. I have realised that the process of delivering AI products for production resembles an art form more than a mechanistic procedure. It entails a significant amount of trial and error, devoid of a fixed set of rules that can be strictly followed to achieve the desired outcome.

Embarking on a Venture: The Internal Wake Word Model at Miko

The wake word model project at Miko marked a significant turning point in my professional trajectory. This initiative was not just about technological advancement; it was a strategic move to shift from reliance on external resources to fostering in-house innovation. Miko had previously outsourced its wake word detection models to Sensory, a renowned provider known for their work with giants like Amazon. However, the financial and strategic implications of this outsourcing necessitated a change.

Our objective was clear yet ambitious: to develop an in-house solution that would reduce our dependence on Sensory's models. This endeavor was more than a project; it was a test of our capabilities to match, and potentially surpass, state-of-the-art solutions in the market. The stakes were high, and the path was uncharted.

As one of the early members and the first in the Audio team at Miko, I was intimately involved from the inception of this project. Despite not leading the project, my consistent contributions and rapid adaptation to new challenges were crucial. The task before us involved extensive research, experimentation with new models, and constant adaptation to the latest AI developments. This project was a manifestation of my belief in learning through direct engagement with problems, and applying theoretical knowledge to practical scenarios. Building an in-house wake word model was a complex puzzle, involving the collection and preparation of vast data sets, training models with varying architectures, and a relentless pursuit of accuracy. Our goal was to achieve a level of precision that would make our model viable for production, a challenge that demanded not only technical expertise but also creative problem-solving and perseverance.

Navigating the Uncharted: Crafting the Wake Word Model

The journey of developing Miko's internal wake word model was akin to navigating uncharted waters. It began with a thorough research phase, where we delved into the realm of open-source models, repositories, and academic papers related to keyword detection. Our choice eventually fell on Google Research's Keyword Transformer (KWT), a decision that set the course for our project.

A pivotal challenge in the wake word model project at Miko was the scarcity of suitable data, a common yet significant obstacle in the field of AI. In the realm of wake word detection, the quality and diversity of data are paramount. Our initial data set was starkly limited, both in volume and variety, particularly concerning positive data (actual wake word instances) and child voice data, crucial for a product designed for children.

Addressing this data scarcity required innovative approaches. We embarked on an extensive data collection and augmentation journey. For negative data (non-wake word sounds), we leveraged a wide array of sources, including open-source databases. We aimed to build not just one, but two models: a less computationally intensive model for the robot itself, and a more powerful, accuracy-focused model for cloud processing. The balancing act between computational efficiency and accuracy was a constant theme throughout our development process.

As we progressed, we faced a significant issue with a high False Acceptance rate. This meant our model was incorrectly recognizing non-wake words as the wake word. To address this, we intensified our efforts in gathering diverse negative data, including household noises, music, pet sounds, and more. This enhancement was pivotal in training the model to distinguish the wake word amidst a cacophony of household sounds. Simultaneously, we shifted towards a more lightweight model architecture, incorporating a Multilayer Perceptron (MLP) network. This decision was guided by our goal to maintain a balance between accuracy and computational lightness.

Another major hurdle was the False Rejects Rate. To mitigate this, we generated over a million samples of positive data using generative AI, focusing on enhancing the diversity and volume of our training dataset. The challenge was further amplified when it came to acquiring child voice data, given the ethical and legal considerations involved. We utilized the limited child voice samples from our beta testers and supplemented them with select open-source datasets. However, the real breakthrough came with the use of generative AI. By employing advanced generative models, we created a substantial corpus of synthetic child voice data. This approach not only solved the issue of data scarcity but also ensured a high degree of diversity and realism in our training dataset.

The final push towards reducing the false rejection rate involved augmenting all our data to account for the unique environmental factors the robot would encounter, such as reverberation and motor noises. This extensive data preparation set the stage for multiple rounds of training, each aimed at fine-tuning the balance of data and improving our model's metrics.

Reflecting on a Journey of Growth and Learning

The journey to surpassing 98% accuracy was not just about reaching a numerical target; it was about understanding the nuances of AI in real-world applications, the significance of each incremental improvement, and the relentless pursuit of excellence, however modest the strides may be.

This project underscored the importance of data diversity and quality in AI development, especially in a field as dynamic and intricate as audio engineering. It was a reminder that in AI, the journey is as crucial as the destination. Every problem encountered was an opportunity to learn, adapt, and grow. The experience gained from this project is invaluable, extending beyond technical skills to encompass lessons in resilience, innovation, and the importance of a problem-solving mindset.

Looking forward, I am excited to carry these insights into future projects. The wake word model project has equipped me with a deeper understanding of AI's possibilities and limitations. My aspiration is to continue contributing to the field of AI audio engineering, embracing new challenges, and pushing the boundaries of what is achievable.