Artificial intelligence is rapidly reshaping the way we work and interact with technology. From text-based chatbots to voice assistants, AI has become an integral part of our daily lives, enhancing productivity and convenience. Large-scale models like GPT-4o have demonstrated remarkable capabilities in understanding and generating human-like text, but deploying these powerful models in real-time applications remains a challenge, especially in embedded systems. Embedded AI offers significant benefits, such as localised processing, reduced cloud dependency, and lower latency in real-time applications. These advantages are crucial for applications requiring voice interaction, automation, or accessibility features. Yet, challenges persist—speech recognition and text generation require substantial computational power, making real-time execution difficult on microcontrollers like the ESP32. The recent work by Binh Pham, showcased on his YouTube channel Build With Binh, is a remarkable example of pushing the boundaries of embedded AI.
In his latest project, Pham successfully implemented a real-time conversational AI system on an ESP32-powered device. He used the SenseCap Watcher from Seed Studio to run his real-time conversational AI, because of its hardware features such as 32MB of flash, 8MB of PSRAM, built-in Display, built-in microphone and built-in speaker with audio amplifier. The project integrates multiple AI technologies, including Silero for voice activity detection, Whisper for speech-to-text conversion, GPT-4o for text processing, and ElevenLabs for text-to-speech synthesis. This sophisticated pipeline enables the device to engage in natural conversations with users, mimicking the voice and personality of Wheatley, a well-known AI character from Portal 2. By leveraging LiveKit’s real-time pipeline, Pham overcame hardware limitations, allowing smooth interaction despite the constraints of the ESP32 microcontroller.
The project stands out not only for its technical achievement but also for its creative execution. Using an open-source SenseCap Watcher device, Pham incorporated an interactive visual display powered by the LVGL library, bringing Wheatley’s animated persona to life. The implementation required deep integration with WebRTC protocols and optimization of real-time audio streaming, demonstrating a blend of software ingenuity and embedded system expertise. But for those who want to experiment with such a project, Binh made his entire project open source. The source code and written tutorial of the project can be found in his GitHub repository.