ESP32 Offline Voice Recognition Using Edge Impulse

Published  November 5, 2025   0
ESP32 Offline Voice Recognition Project

Voice recognition is becoming an essential part of the Internet of Things (IoT), allowing us to control, monitor, and interact with devices hands-free. At its core,  ESP32 speech to text works by capturing audio through a microphone, processing it, and converting it into text that devices can understand. This text is then used to trigger commands, just like how Alexa or Google Assistant responds to your voice. Interestingly, ESP32 offline voice recognition systems often use a mix of edge and cloud computing, with some processing done directly on the device and the rest handled in the cloud. If you're interested in building projects that can locate where a sound is coming from, check out this detailed guide on detecting the direction of sound using Arduino. For automation enthusiasts, the Smart Desk Bot project demonstrates how to create a voice-controlled desk assistant using IoT and AI features.

That brings us to today’s project, where we’ll build a mini yet expandable ESP32 voice assistant using Edge Impulse, with the ESP32 as the main microcontroller. The goal of ESP32 speech recognition offline is to keep it simple while still leaving room for future upgrades. So, without any further delay, let’s dive right in!

ESP32 Voice Recognition Project Reference

ComponentSpecificationPurpose
MicrocontrollerESP32 (240 MHz)Main processing unit for offline voice recognition
Microphone ModuleINMP441 MEMSHigh-precision audio capture
ML PlatformEdge ImpulseTraining ESP32 speech recognition model
Sample Rate16 kHzAudio sampling frequency
Recognition TypeOffline/On-deviceNo internet required for operation

ESP32 Voice Recognition Project Overview: Building Your ESP32 Voice Assistant

The main objective of this project is to achieve ESP32 speech recognition offline. While most projects on the internet rely on third-party APIs like Google TTS, here we’re taking a different approach using Edge Impulse. Instead of cloud-based solutions, we're implementing an ESP32 voice recognition module using Edge Impulse for on-device machine learning.  In fact, you could even use TensorFlow Lite directly if you’d like, though it’s a bit more complex; we’ll cover that in another article. For now, let’s focus on the Edge Impulse method. This approach ensures your ESP32 voice assistant works independently without internet connectivity, making it ideal for privacy-conscious applications and remote IoT deployments.

ESP32 voice recognition offline system with Edge Impulse showing complete hardware setup with INMP441 microphone module and ESP32 development board

Key Features of This ESP32 Speech-to-Text System:

✓ Fully offline operation - No cloud dependency for ESP32 offline voice recognition
✓ Low latency response - On-device processing ensures quick recognition
✓ Customizable wake word - Train your ESP32 voice assistant with custom commands
✓ Expandable architecture - Easy to add more voice commands
✓ Privacy-focused - Audio never leaves your device
✓ Cost-effective solution - Uses affordable ESP32 hardware

You can also learn how to make your microcontroller speak with this Arduino-based Text-to-Speech Converter project, which converts written text into audible speech.

ESP32 Voice Recognition Workflow - Step by Step

In this setup, most of the heavy lifting is done by Edge Impulse, which handles the machine learning side of things. Apart from that, our tasks mainly involve collecting the dataset and assembling the circuit.

Building an ESP32 speech recognition offline system involves six streamlined steps:

  • Collect the dataset
  • Train the model in Edge Impulse
  • Export it as an Arduino library from Edge Impulse
  • Assemble the circuit
  • Do a bit of code modification and upload it to the ESP32
  • Test everything in real time

Now, let’s move into the actual project, starting with collecting the dataset.

How to Collect a Dataset for ESP32 Speech Recognition Offline Training

Successful ESP32 offline voice recognition starts with quality training data. There are various methods available to collect a dataset that impacts your ESP32 voice recognition module's accuracy. The easiest way is to use open-source datasets. It really depends on the type or category of dataset you need. For common projects, it’s usually easy to find what you’re looking for, and there are plenty of options available. There are two primary approaches: using pre-existing open-source datasets or creating custom recordings tailored to your specific ESP32 voice assistant application.

Best Open-Source Datasets for ESP32 Speech to Text Projects

For rapid prototyping of your ESP32 speech recognition offline system, these curated datasets provide excellent starting points:

Of course, there are many more options out there, but I just mentioned the most common ones. With that said, here’s the dataset I used in this project:

Dataset I used in this project: Google’s Speech Command Dataset V2

Another method is to collect data yourself by manually recording short clips (1–3 seconds) of the required audio in multiple voices, tones, and speeds. The more data you collect, the better your training results will be. For specialised ESP32 voice assistant applications, custom recording ensures optimal performance. Explore the DigiKey Voice-Activated Office Assistant to see how to integrate speech recognition with office automation.

Training Your ESP32 Voice Recognition Module in Edge Impulse

Once you have collected your dataset for training, we can proceed to the Edge Impulse side. Edge Impulse provides a powerful platform for training ESP32 speech recognition offline models.  

Below you can see the step-by-step instructions, from creating a new project to exporting the trained model as an Arduino library. This step-by-step guide walks you through creating, training, and deploying your custom ESP32 voice assistant model.

⇒ Step 1:Create Your Edge Impulse Account for ESP32 Speech to Text
Searching for Edge Impulse in Google [1] gives you the site link at the top [2]. Clicking on that link takes you to the home page of Edge Impulse. This is where you'll develop your ESP32 voice recognition module intelligence.

Edge Impulse login page for ESP32 speech recognition offline model training platform

Here, click the Login [3] button in the top-right corner. For convenience, I chose the Continue with Google [4] option. After completing the necessary steps, you should reach the main page of Edge Impulse.

⇒ Step 2: Initialise Your ESP32 Voice Recognition Project
On the home page of Edge Impulse, click the Create New Project button [5] in the top-right corner. To begin configuring your ESP32 speech recognition offline system.

Creating new Edge Impulse project for ESP32 voice assistant with project naming interface

Enter a name for your project [6]. This name will follow through until the final Arduino library export. The project type will default to Personal, and the setting will default to Private. Finally, click Create New Project [7].

After successful project creation, you’ll see the project’s home page.

⇒ Step 3: Configure ESP32 Target Device for Voice Recognition

In the project page, the first thing to do is configure the target device and application budget. Proper device configuration ensures optimal performance for your ESP32 speech to text model.

ESP32 device configuration in Edge Impulse showing ESP-EYE target selection for offline voice recognition

In the top-right corner, you’ll see the Target: Cortex-M4F 80 MHz button [8]. Clicking on it opens the configuration menu. The first step is selecting the target device [9]. From the dropdown menu, choose Espressif ESP-EYE (ESP32 240 MHz) [10] to optimise for ESP32 offline voice recognition. Finally, click the Save button [11] at the bottom-right to confirm ESP32 voice recognition module configuration.

⇒ Step 4: Upload Training Data for ESP32 Speech Recognition Offline
Now, let’s move to the next step in the workflow: data acquisition [12] of the ESP32 voice assistant.

Data Acquisition And Uploading Sensor Data In Edge Impulse

Click the Add Data [13] button. This opens three options: Upload Data, Import from Project, and Add Storage Bucket. This interface allows bulk upload of audio files for your ESP32 speech to text training.

Here, click Upload Data [14].

Labeling audio samples for ESP32 voice recognition module training with command categories

A new menu will appear with some options. Since I had properly saved the dataset in individual folders, I used the Select Your Folder [15] option. Next, I chose the folder using Select Files [16][17][18]. After selecting the folder, a message will show the number of files selected. Critical step: Label each dataset appropriately [20] (e.g., "noise", "marvin", "on", "off") for accurate ESP32 voice recognition module training. Then click Upload [19] to complete the data acquisition for your ESP32 speech recognition offline model.  

The most important step here is labelling the data. Since I uploaded datasets individually, I labelled them right away to save effort later. In the label option, I selected noise [20] for the noise dataset. Finally, click Upload [21], and the data will be added successfully [22]. 

Complete dataset uploaded in Edge Impulse showing multiple labeled categories for ESP32 voice assistant

Similarly, I added data for Marvin, off, and on, as shown in the image above. In this ESP32 offline voice recognition project, we've added datasets for: noise (background audio), Marvin (wake word), off, and on (control commands).

⇒ Step 5: Design Neural Network Architecture for ESP32 Voice Assistant

On the left panel, you’ll see Impulse Design. Under this, click Create Impulse [23]. Here, select the Processing Block [24] and Learning Block [25], then click Save Impulse [26] your ESP32 speech recognition offline model, then click Save Impulse [26]. Don’t worry too much about the options; the default ones work fine. 

Creating impulse design for ESP32 speech to text neural network architecture in Edge Impulse

After saving, you’ll get a notification confirming that the impulse was saved successfully.

⇒ Step 6: Generate Features for ESP32 Speech Recognition Offline Model

After saving the impulse, you’ll be prompted with the Parameters menu. Here, the Autotune Parameters [27] option comes in handy to automatically optimise settings for your ESP32 voice recognition module. Click it, wait a while, and then click Save Parameters [28].

Auto-tuning parameters for ESP32 offline voice recognition model optimization

Now, click Generate Features [29]. In the image below, you can see how the dataset is represented with different colours. The more distinct and separated the colour clusters, the better the results will be.

Feature generation visualization showing data separation for ESP32 voice assistant training

In my case, the clusters were a little close together, so the separation wasn’t very strong. The colour-coded clusters represent different voice commands in your ESP32 offline voice recognition system. Well-separated clusters (distinct colours with minimal overlap) indicate strong training data quality and predict higher accuracy for your ESP32 voice recognition module.

⇒ Step 7: Train Your ESP32 Voice Recognition Module Neural Network
Next, you’ll be taken to the Neural Network Settings page. Here you can adjust parameters like Training Cycles [30], Learning Rate, etc, to optimise your ESP32 speech recognition offline model performance. Most of the default settings work fine.

Neural network training settings for ESP32 voice assistant model with configuration options

I kept the default settings and clicked Save & Train [31]. Now sit back, it will take some time, depending on the size of your dataset. Once complete, you’ll see the model’s training performance.

⇒ Step 8: Validate ESP32 Speech to Text Model Accuracy

In the Test Model window, simply click the Classify All [32] button. This runs the test to automatically evaluate your ESP32 voice recognition module against reserved test data. This validation step provides unbiased accuracy metrics for your ESP32 speech recognition offline system.

Testing ESP32 voice recognition module accuracy with classify all function in Edge Impulse

As you can see, my results were acceptable, so I moved on to deployment. For production ESP32 voice assistant applications, aim for 90%+ accuracy. Scores above 85% are acceptable for prototype and development purposes.

⇒ Step 9: Deploy ESP32 Speech Recognition Offline Model as Arduino Library

Go to the Deployment window. In the search bar [33], type Arduino Library [34]. Then click the Build [35] button to start compiling your ESP32 voice recognition module into an Arduino-compatible library.

Deploying ESP32 speech to text model as Arduino library from Edge Impulse platform

Once the build finishes, you’ll be prompted to save the library [36]. The library will be downloaded as a .zip file.

⇒ Step 10: Install Library and Configure ESP32 Voice Assistant Code

Open the Arduino IDE and go to Sketch → Include Library → Add .ZIP Library [37]. Select the downloaded library [38] and click Open [39] to install the ESP32 voice recognition module framework.

Installing Edge Impulse library in Arduino IDE for ESP32 offline voice recognition development

Now, go to File → Examples → [Your Project Name]_Inferencing → esp32 → esp32_microphone to open the example program [40]. This is the code you’ll upload to the ESP32 board. This pre-configured sketch provides a complete starting point for your ESP32 voice assistant, including audio capture, preprocessing, and inference code optimised for ESP32 speech to text.

Now, let’s move on to the hardware!

ESP32 Voice Assistant Hardware Setup: Wiring Your ESP32 Voice Recognition Module

The hardware setup for this ESP32 speech recognition offline project is straightforward, requiring minimal components. The circuit consists of two LEDs, one microphone, and the main ESP32 module. Below is the circuit diagram that needs to be implemented. This setup creates a fully functional ESP32 voice assistant capable of wake word detection and command execution.

Required Components for ESP32 Voice Recognition Module

ComponentQuantityFunction
ESP32 Development Board1Main processor for ESP32 offline voice recognition
INMP441 MEMS Microphone1High-precision audio input for ESP32 speech to text
LED (Indicator)1Visual feedback for wake word detection
LED (Control)1Controllable output via voice commands
Resistors (220Ω)2Current limiting for LEDs
Breadboard & Jumper WiresAs neededPrototyping connections

 

Complete circuit diagram for ESP32 voice recognition offline showing INMP441 microphone connections and LED wiring to ESP32 GPIO pins

INMP441 Microphone Module Pinout for ESP32 Speech to Text

There are two LEDs used here, each serving a separate purpose. One is an indicator LED [23] for wake-word detection and voice recognition status. The other is a control LED [GPIO22], which can be turned on or off via commands. The INMP441 MEMS microphone uses I²S (Inter-IC Sound) protocol for high-quality audio transmission to your ESP32 voice assistant. Two LEDs provide essential visual feedback for your ESP32 speech recognition offline system. Proper pin connections are critical for reliable ESP32 offline voice recognition: 

The main component is a microphone module: the INMP441 MEMS High-Precision Omnidirectional Microphone Module. This module has six pins: 

  1. L/R [3V3]
  2. WS [25]
  3. SCK [26]
  4. SD [33]
  5. VDD [3V3]
  6. GND [GND]

That’s all for the circuit.

After a successful connection, it will look like this:

I used a slightly different approach in the actual hardware to make it minimalistic and simple. The circuit diagram is for reference and was made to guide you.

Remember:

  • The default I²S pins in the Arduino library downloaded from Edge Impulse are:

  1. L/R [3V3]
  2. WS [32]
  3. SCK [26]
  4. SD [33]
  5. VDD [3V3]
  6. GND [GND]
  • Using these default pins allows the example program to work without any code modification. The circuit above shows pins for our modified code, but changing the pins is easy if you want to adapt it.

INMP441 to ESP32 Pin Connections for Voice Recognition Module

INMP441 PinESP32 GPIOSignal Description
L/R3V3Channel selection (3.3V = Right channel)
WS (Word Select)GPIO 25Left/Right clock for I²S audio
SCK (Serial Clock)GPIO 26Bit clock for audio data
SD (Serial Data)GPIO 33Audio data input to ESP32
VDD3V3Power supply (3.3V)
GNDGNDGround reference

ESP32 Speech to Text Code Implementation: Programming Your ESP32 Voice Recognition Module

Testing the Example Code for ESP32 Offline Voice Recognition

Before diving into advanced features, test the base functionality of your ESP32 voice assistant using the Edge Impulse example program.

Files → Examples → “[Name of your project]”_ Inferencing → esp32 → esp32_microphone[40]. This raw example allows you to test the model in real time with your hardware.

Arduino IDE showing ESP32 speech recognition example code for testing voice recognition module

In the code, you’ll find the following snippet. You can use it to adjust your circuit accordingly or modify the pins in the code to make it work. This raw example demonstrates real-time ESP32 speech to text inference. The Serial Monitor displays confidence levels for each trained word, showing how your ESP32 offline voice recognition model interprets audio input.

 i2s_pin_config_t pin_config = {
     .bck_io_num = 26,    // IIS_SCLK
     .ws_io_num = 32,     // IIS_LCLK
     .data_out_num = -1,  // IIS_DSIN
     .data_in_num = 33,   // IIS_DOUT
 };

You can see live inference in the Serial Monitor, which displays the confidence level for each trained word. This flexibility allows you to adapt the ESP32 voice recognition module to your specific pin layout.


Using this confidence level, you can make modifications in the code to add more advanced features.

Modified Code for the ESP32 Voice Control Project

The example code was modified to include features like wake word detection, LED control, and more. Let’s go through the code in simple terms.

Overview 
This code is for an advanced voice-controlled LED system using the ESP32 with Edge Impulse machine learning for wake word detection ("marvin") and command recognition ("on"/"off"). The system uses dual confidence thresholds and provides comprehensive visual feedback. The enhanced code transforms the basic example into a production-ready ESP32 speech recognition offline system with wake word detection, command processing, and visual feedback. Let's explore the key components that make this ESP32 voice assistant intelligent and responsive.

Included Libraries

  1. ESP32_STT_Recognition_inferencing.h – Edge Impulse’s deployment library. In my code, it’s called ESP32_STT_Recognition_inferencing. In your case, it will be [Your Project Name]_inferencing. You can check by opening the example program from the installed library.
  2. freertos/FreeRTOS.h, freertos/task.h – Real-time operating system components that enable multitasking for parallel audio capture and processing.
  3. driver/i2s.h – I2S audio driver for the ESP32, used to interface with the digital MEMS microphone for audio input.

Data Structures, Constants & Configuration Variables

  1. Core Libraries for ESP32 Voice Recognition Module

typedef struct {
   int16_t *buffer;        // Audio sample storage
   uint8_t buf_ready;      // Flag: buffer ready for processing
   uint32_t buf_count;     // Current write position
   uint32_t n_samples;     // Total buffer capacity
} inference_t;

This manages a circular buffer for real-time audio capture and ML processing. These libraries enable multitasking for parallel audio capture and ML inference, essential for responsive ESP32 offline voice recognition. FreeRTOS ensures your ESP32 voice assistant can simultaneously record audio, process speech, and control outputs without blocking.

2. Dual Confidence Threshold System for Accurate ESP32 Speech Recognition Offline

const float COMMAND_CONFIDENCE_THRESHOLD = 0.80; 
const float RECOGNITION_CONFIDENCE_THRESHOLD = 0.50;

COMMAND_CONFIDENCE_THRESHOLD is used for executing commands, while RECOGNITION_CONFIDENCE_THRESHOLD is for indication only. Higher thresholds prevent false command execution.

3. Timing Configuration

const unsigned long LISTENING_DURATION_MS = 10000;

This is the time window after the wake word during which commands are accepted.

4. Hardware Pin Definitions for ESP32 Speech to Text

const int CONTROL_LED_PIN = 22;   
const int INDICATOR_LED_PIN = 23;  

CONTROL_LED_PIN is the main LED being controlled, and INDICATOR_LED_PIN provides status feedback.

5. Audio Processing

static const uint32_t sample_buffer_size = 2048;

This is the I2S DMA buffer size for audio capture.

6. State Management for ESP32 Offline Voice Recognition

static bool wake_word_detected = false; 
static unsigned long wake_word_timestamp = 0;
static bool listening_mode = false;      
static bool led_state = false;         
static inference_t inference;       
static signed short sampleBuffer[2048];         
static bool debug_nn = false;               
static bool record_status = true;                

These variables manage detection status, timing, listening mode, LED state, recording state, audio buffer, and debug flags. These variables track the complete state of your ESP32 speech recognition offline system, managing wake word detection, command timing, LED status, and audio buffer management.

Setup Function

1. Serial Communication

Open the Arduino IDE Serial Monitor (115200 baud) to view real-time inference data from your ESP32 speech recognition offline system

Serial.begin(115200);
while (!Serial); // optional

Sets up the serial interface for debugging and monitoring at 115200 baud.

2. Hardware Pin Configuration

pinMode(CONTROL_LED_PIN, OUTPUT);
digitalWrite(CONTROL_LED_PIN, LOW);
pinMode(INDICATOR_LED_PIN, OUTPUT);
digitalWrite(INDICATOR_LED_PIN, LOW);

Configures GPIO pins for LED control and initialises them to OFF.

3. Audio System Initialisation

if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
   ei_printf("ERR: Could not allocate audio buffer\n");
   Return;
}

Allocates audio buffer, initialises the I2S microphone, and starts the capture task.

Main Loop Function

The loop() function runs continuously and performs three key operations:

  1. Audio Capture

bool m = microphone_inference_record();
if (!m) {
   ei_printf("ERR: Failed to record audio...\n");
   return;
}

Waits for the audio buffer to fill with new samples.

2. ML Inference

signal_t signal;
signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
signal.get_data = &microphone_audio_signal_get_data;
EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);

Runs the Edge Impulse classifier to recognise speech and provides confidence levels for each trained word.

3. Command Processing

handle_wake_word_and_commands(result);

Processes ML results and executes appropriate actions.

4. Debug Output

ei_printf("Predictions (DSP: %d ms., Classification: %d ms.): \n", ...);
for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
   ei_printf("    %s: %.2f\n", result.classification[ix].label, ...);
}

Displays inference timing and confidence scores for all classes.

Custom Functions

There are four main categories of custom functions:

1. LED Visual Feedback System

show_wake_word_pattern()
show_listening_mode()
show_turning_on_pattern()
show_turning_off_pattern()
show_general_recognition_pattern()
stop_listening_mode()

These control the indicator LED to show what the system is doing. For example:

  • Saying "Marvin" blinks the LED twice quickly.
  • LED stays on while waiting for a command.
  • Saying "on" flashes the LED for confirmation.
  • Saying "off" produces a fading effect.
  • Other words trigger a single quick blink.
  • stop_listening_mode() turns off the indicator when the listening window ends.

2. Command Processing Function

handle_wake_word_and_commands(ei_impulse_result_t &result)

This is the brain of the system. It finds the word with the highest confidence score. Words above 50% blink the LED once as feedback. Commands execute only if confidence is above 80%. Detection of "marvin" starts listening mode, then "on"/"off" commands control the main LED. This two-step process prevents accidental activations.

3. Audio Capture & Processing System
Functions,

audio_inference_callback(uint32_t n_bytes)
capture_samples(void* arg)
microphone_inference_start(uint32_t n_samples)
microphone_inference_record(void)
microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)
microphone_inference_end(void)

These handle continuous audio input. capture_samples runs in the background, reading audio from the I2S microphone and amplifying it. Data is copied to a buffer, and when full, is sent for processing. Functions convert audio to the format needed by the ML model and manage memory and background tasks.

4. I2S Hardware Interface

Functions,

i2s_init(uint32_t sampling_rate)
i2s_deinit(void)

These configure the ESP32 I2S hardware. i2s_init sets sample rate (16kHz), DMA buffers, and GPIO pins (26 for bit clock, 25 for word select, 33 for data input). Audio is captured continuously without CPU overhead. i2s_deinit shuts down I2S cleanly, though it’s not used in this always-on setup.


With this explanation, you now have a clear idea of how the code works for this project. Next, we’ll see it in implementation.

Testing Your ESP32 Speech Recognition Offline System

After successfully uploading the code, you can ESP32 voice assistant start observing it in action. The operation is simple. It uses a wake word, “Marvin.” When it recognises the wake word, it will wait for 10 seconds for a command word. The command word is then recognised and used to execute certain functions, like turning a light on or off. The system operates on a simple two-stage command structure optimised for reliable ESP32 offline voice recognition:

This can be seen in the GIF above, where a person says “Marvin” followed by “on” to turn on the blue light, and “off” to turn it off.

With this, you should have a clear idea of the whole project. You can still add any features to expand this project according to your imagination - this is just my implementation. One challenge is finding a proper dataset; if it’s suitable, everything should work fine. Finally, if you're curious about giving your Arduino the ability to recognise spoken commands, take a look at this Arduino Speech Recognition tutorial.

GitHub Repository 

Visit the GitHub repository to download the source code, make modifications, and deploy the project effortlessly.

Complete Code of ESP32 Speech Recognition ProjectComplete Code of ESP32 Speech Recognition Project

Frequently Asked Questions: ESP32 Voice Recognition & ESP32 Speech to Text

⇥ 1. What is the difference between ESP32 speech recognition and Text-to-Speech using ESP32?
ESP32 STT converts spoken audio into text commands for device control. TTS generates spoken audio output using the ESP32 from the text. This project focuses on STT for voice control. Additional audio output hardware and synthesis libraries are needed for TTS, and these can be integrated separately for voice response capabilities.

⇥ 2. Why should I use Edge Impulse to power my ESP32 voice assistant instead of Google TTS?
Edge Impulse supports true ESP32 offline voice recognition without worrying about the internet, API costs, or privacy concerns. Google TTS requires internet access and bills for each request, while with Edge Impulse, you will create self-contained models that run locally on the device, providing faster response times, unlimited usage, complete privacy, and working in offline environments and situations that are typical for Internet of Things (IoT) applications.

⇥ 3.  What is the best microphone to use for offline many-voice detection with the ESP32?
The INMP441 MEMS omnidirectional microphone is the best choice for ESP32 voice recognition module applications. Based upon I²S digital output (no analog noise), 61dB SNR, omnidirectional pickup pattern, and 3.3V direct interface compatibility. You may also consider alternate options, such as a SPH0645 (similar to I²S MEMS), or a MAX9814 (an analog option with AGC, based upon your project specifications and budget constraints).

⇥ 4. How much power does an ESP32 voice assistant consume?
Active ESP32 speech recognition offline systems consume about 160-260mA during continuous listening (ESP32: 80-160mA, INMP441: 1.4mA, LEDs: 20-40mA each). Wake-word detection with deep sleep between activations reduces average power to as low as 10-50mA. This enables operating on batteries directly. In ultra-low power applications, external wake-word detection ICs can be used to trigger ESP32 voice recognition only when needed.

⇥ 5. Is it feasible to create a multilingual offline voice recognition system using one ESP32? 
Certainly, Edge Impulse has the appropriate multilingual training technique for voice recognition modules used with the ESP32. While collecting the dataset, take audio samples of spoken words in all required languages, and the neural network will learn acoustic patterns which are not specific to a language. Likewise, you could build separate models for each required language and switch between the models. The performance of the models and the inference may vary, depending on whether the languages share any similarity and the amount of memory the ESP32 has available to run a larger multilingual model.

⇥ 6. What is the recognition latency of ESP32 speech-to-text systems? 
Typical ESP32 voice assistant latency stands at approximately 200-500ms total: audio capture - 1000ms window, DSP preprocessing (50-100ms), neural network inference (100-300ms), and command execution (<50ms). Edge Impulse optimises inference for the ESP32's 240MHz dual-core processor. Latency rises with increased model complexity but is usually unnoticeable for most voice control applications, given the latency associated with cloud-based services of one to two seconds.

⇥ 7. What are some ways to enhance the ESP32 voice recognition module's accuracy in the presence of background noise?
Improve offline speech recognition in ESP32 by: (1) adding realistic background noise to the training data, (2) noise reduction in pre-processing, (3) placing directional microphones, (4) increasing confidence thresholds for wake words, (5) enclosing some acoustic isolation around INMP441, (6) retraining with voices at different distances, and (7) multi-confirmation on critical commands will prevent false positives.

Conclusion: Mastering ESP32 Speech Recognition Offline

You've now completed a working and fully self-contained ESP32 voice recognition module with the ability to operate offline without reliance on the cloud. This ESP32 speech to text system is proof of concept of edge computing for voice interfaces, combining machine learning, embedded systems, and real-time audio processing into an achievable IoT system of value. The framework you've built with this ESP32 voice assistant is endlessly expandable - from controlling a simple LED to integrating any disproportionate set of voice commands which can control home automation, industrial control systems, or even accessibility devices - having ESP32 offline voice recognition opens so many possibilities. 

This project was created by the Circuit Digest engineering team. Our experts focus on creating practical, hands-on tutorials that help makers and engineers master Raspberry Pi projects, Arduino projects, ESP32 Projects and IoT development projects.

I hope you liked this article and learned something new from it. If you have any doubts, you can ask in the comments below or use our forum for a detailed discussion.
 

Projects with Voice Control System 

Previously, we have used the Voice Control System to build many interesting projects. If you want to know more about those topics, links are given below.

 Voice-Controlled Smart Home Assistant

Voice-Controlled Smart Home Assistant

Develop a voice-controlled smart home assistant with Maixduino Kit, voice recognition, motion detection, and MQTT integration for seamless automation and enhanced user interaction with smart devices.

 Building a Voice Controlled Home Automation System with Arduino

Building a Voice-Controlled Home Automation System with Arduino

Voice-controlled home Automation Using Arduino is an exciting project that aims to automate home appliances with the power of voice commands. In this project, voice instructions will be recognised, and text-to-speech conversion will be performed using an Android app.

Voice Controlled Lights using Raspberry Pi

Voice-Controlled Lights Using Raspberry Pi

In this Raspberry Pi home automation project, we will send voice commands from a smartphone to the Raspberry Pi using a Bluetooth Module, and the Raspberry Pi will receive that transmitted signal wirelessly and will perform the respective task over the hardware. 

 Alexa Voice Controlled LED using Raspberry Pi and ESP12

Alexa Voice Controlled LED using Raspberry Pi and ESP12

In this DIY tutorial, we are making an IoT project to control home appliances from anywhere in the world using AlexaPi and ESP12E.

Complete Project Code

/*
* ===============================================================================
* ESP32 ENHANCED VOICE CONTROL SYSTEM WITH EDGE IMPULSE
* ===============================================================================
* 
* Author: CircuitDigest/RithikKrisna[me_RK]
* Date: August 2025
* 
* DESCRIPTION:
* Advanced voice-controlled LED system using ESP32 microcontroller with Edge Impulse
* machine learning for wake word detection and command recognition. Features dual
* confidence thresholds for reliable command execution and enhanced user feedback.
* 
* FEATURES:
* --------
* • Wake Word Detection: "marvin" activates listening mode
* • Voice Commands: "on" and "off" to control LED
* • Dual Confidence System:
*   - 80% threshold for reliable command execution
*   - 50% threshold for recognition feedback
* • Visual Feedback System:
*   - Control LED (Pin 22): Main device being controlled
*   - Indicator LED (Pin 23): Status and pattern feedback
* • Listening Window: 10-second timeout after wake word
* • Real-time Audio Processing: Continuous inference with I2S microphone
* 
* LED PATTERNS:
* ------------
* • Wake Word ("marvin"): 2 quick pulses + steady listening indicator
* • Turn On Command: Triple flash + confirmation
* • Turn Off Command: Fade pattern + confirmation  
* • General Recognition: Single quick blink for detected words
* • Listening Mode: Steady indicator light
* 
* HARDWARE REQUIREMENTS:
* ---------------------
* • ESP32 Development Board
* • I2S Digital Microphone (INMP441 or similar)
* • 2x LEDs with appropriate resistors
* • Breadboard and connecting wires
* 
* PIN CONFIGURATION:
* -----------------
* • I2S Microphone:
*   - BCK (Bit Clock): Pin 26
*   - WS (Word Select): Pin 25  
*   - Data In: Pin 33
* • LEDs:
*   - Control LED: Pin 22
*   - Indicator LED: Pin 23 (optional, set to -1 to disable)
* 
* SOFTWARE DEPENDENCIES:
* ---------------------
* • Edge Impulse Arduino Library
* • ESP32 Arduino Core
* • FreeRTOS (included with ESP32 core)
* 
* USAGE:
* ------
* 1. Say "marvin" to activate listening mode (indicator LED turns on)
* 2. Within 10 seconds, say "on" or "off" to control the main LED
* 3. System provides visual feedback for all recognized words
* 
* ===============================================================================
*/
// ===============================================================================
// PREPROCESSOR DIRECTIVES AND MEMORY OPTIMIZATION
// ===============================================================================
// Memory optimization: Remove this macro to save 10K RAM if target is memory-limited
#define EIDSP_QUANTIZE_FILTERBANK   0
/*
** MEMORY ALLOCATION NOTE:
** If you encounter TFLite arena allocation issues due to dynamic memory fragmentation,
** try defining "-DEI_CLASSIFIER_ALLOCATION_STATIC" in boards.local.txt and copy to:
** <ARDUINO_CORE_INSTALL_PATH>/arduino/hardware/<mbed_core>/<core_version>/
** 
** Reference: https://support.arduino.cc/hc/en-us/articles/360012076960-Where-are-the…-
** If issues persist, insufficient memory for this model and application.
*/
// ===============================================================================
// SYSTEM INCLUDES
// ===============================================================================
#include <"YOUR_PROJECT_TITLE"_inferencing.h>  // Edge Impulse generated library
#include "freertos/FreeRTOS.h"                   // Real-time operating system
#include "freertos/task.h"                       // Task management
#include "driver/i2s.h"                          // I2S audio driver
// ===============================================================================
// AUDIO PROCESSING DATA STRUCTURES
// ===============================================================================
/** 
* Audio inference buffer structure
* Manages circular buffer for real-time audio processing
*/
typedef struct {
   int16_t *buffer;        // Audio sample buffer
   uint8_t buf_ready;      // Buffer ready flag
   uint32_t buf_count;     // Current buffer position
   uint32_t n_samples;     // Total number of samples
} inference_t;
// ===============================================================================
// SYSTEM CONFIGURATION CONSTANTS
// ===============================================================================
// AI Model Confidence Thresholds
const float COMMAND_CONFIDENCE_THRESHOLD = 0.80;      // 80% for reliable command execution
const float RECOGNITION_CONFIDENCE_THRESHOLD = 0.50;   // 50% for recognition feedback
// Timing Configuration
const unsigned long LISTENING_DURATION_MS = 10000;     // 10-second listening window after wake word
// Hardware Pin Configuration
const int CONTROL_LED_PIN = 22;                        // Main LED being controlled
const int INDICATOR_LED_PIN = 23;                      // Status indicator LED (-1 to disable)
// Audio Processing Configuration
static const uint32_t sample_buffer_size = 2048;       // I2S sample buffer size
// ===============================================================================
// GLOBAL STATE VARIABLES
// ===============================================================================
// Wake Word System State
static bool wake_word_detected = false;                // Wake word detection flag
static unsigned long wake_word_timestamp = 0;          // Timestamp of last wake word
static bool listening_mode = false;                    // Active listening state
static bool led_state = false;                         // Current LED state
// Audio Processing State
static inference_t inference;                          // Audio inference structure
static signed short sampleBuffer[sample_buffer_size]; // Raw audio sample buffer
static bool debug_nn = false;                         // Neural network debug output
static bool record_status = true;                     // Recording status flag
// ===============================================================================
// LED VISUAL FEEDBACK SYSTEM
// ===============================================================================
/**
* @brief Create wake word detection pattern
* Shows 2 quick pulses to indicate "marvin" was detected
*/
void show_wake_word_pattern() {
   if (INDICATOR_LED_PIN == -1) return;
   
   // Wake pattern - 2 quick pulses to indicate wake word detected
   for (int cycle = 0; cycle < 2; cycle++) {
       digitalWrite(INDICATOR_LED_PIN, HIGH);
       delay(150);
       digitalWrite(INDICATOR_LED_PIN, LOW);
       delay(100);
   }
}
/**
* @brief Activate listening mode indicator
* Shows steady light during command listening period
*/
void show_listening_mode() {
   if (INDICATOR_LED_PIN == -1) return;
   digitalWrite(INDICATOR_LED_PIN, HIGH);
}
/**
* @brief Show "turning on" confirmation pattern
* Triple flash followed by confirmation sequence
*/
void show_turning_on_pattern() {
   if (INDICATOR_LED_PIN == -1) return;
   
   // Quick triple flash then steady
   for (int i = 0; i < 1; i++) {
       digitalWrite(INDICATOR_LED_PIN, HIGH);
       delay(100);
       digitalWrite(INDICATOR_LED_PIN, LOW);
       delay(100);
   }
   // Brief pause then steady on for confirmation
   delay(200);
   digitalWrite(INDICATOR_LED_PIN, HIGH);
   delay(800);
   digitalWrite(INDICATOR_LED_PIN, LOW);
}
/**
* @brief Show "turning off" confirmation pattern
* Fade-like pattern to indicate device turning off
*/
void show_turning_off_pattern() {
   if (INDICATOR_LED_PIN == -1) return;
   
   // Start bright, then fade to off with pauses
   digitalWrite(INDICATOR_LED_PIN, HIGH);
   delay(300);
   digitalWrite(INDICATOR_LED_PIN, LOW);
   delay(150);
   digitalWrite(INDICATOR_LED_PIN, HIGH);
   delay(200);
   digitalWrite(INDICATOR_LED_PIN, LOW);
   delay(150);
   digitalWrite(INDICATOR_LED_PIN, HIGH);
   delay(100);
   digitalWrite(INDICATOR_LED_PIN, LOW);
   // Final confirmation that it's off
   delay(500);
}
/**
* @brief Show general word recognition pattern
* Single blink for words detected above 50% confidence (non-commands)
*/
void show_general_recognition_pattern() {
   if (INDICATOR_LED_PIN == -1) return;
   
   // Single quick blink for general recognition
   digitalWrite(INDICATOR_LED_PIN, HIGH);
   delay(150);
   digitalWrite(INDICATOR_LED_PIN, LOW);
}
/**
* @brief Deactivate listening mode
* Turns off indicator LED and resets listening state
*/
void stop_listening_mode() {
   if (INDICATOR_LED_PIN == -1) return;
   listening_mode = false;
   digitalWrite(INDICATOR_LED_PIN, LOW);
}
// ===============================================================================
// VOICE COMMAND PROCESSING ENGINE
// ===============================================================================
/**
* @brief Enhanced wake word detection and command processing system
* 
* Processes inference results with dual confidence thresholds:
* - High threshold (80%) for reliable command execution
* - Lower threshold (50%) for user feedback and recognition
* 
* @param result Edge Impulse inference result containing classification data
*/
void handle_wake_word_and_commands(ei_impulse_result_t &result) {
   // Find the classification label with highest confidence
   float max_confidence = 0.0;
   String detected_word = "";
   
   // Iterate through all possible classifications
   for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
       if (result.classification[ix].value > max_confidence) {
           max_confidence = result.classification[ix].value;
           detected_word = String(result.classification[ix].label);
       }
   }
   
   float max_confidence_percentage = max_confidence * 100.0;
   
   // ENHANCED FEATURE: Recognition feedback for any detected word above 50%
   // Provides user feedback even for words below command threshold
   if (max_confidence >= RECOGNITION_CONFIDENCE_THRESHOLD && !detected_word.equals("noise")) {
       // Only show general recognition if not in command listening mode
       if (!listening_mode && !detected_word.equals("marvin") && 
           !detected_word.equals("on") && !detected_word.equals("off")) {
           show_general_recognition_pattern();
       }
       ei_printf("Recognized: %s (%.1f%%)\n", detected_word.c_str(), max_confidence_percentage);
   }
   
   // Check if listening window has expired (10-second timeout)
   if (wake_word_detected) {
       unsigned long current_time = millis();
       if (current_time - wake_word_timestamp > LISTENING_DURATION_MS) {
           wake_word_detected = false;
           listening_mode = false;
           stop_listening_mode();
           ei_printf("Listening window expired.\n");
           return;
       }
   }
   
   // COMMAND PROCESSING: Only execute commands above 80% confidence threshold
   if (max_confidence_percentage < (COMMAND_CONFIDENCE_THRESHOLD * 100)) {
       return;
   }
   
   // Ignore background noise classifications
   if (detected_word.equals("noise")) {
       return;
   }
   
   ei_printf("Detected: %s (%.2f%%)\n", detected_word.c_str(), max_confidence_percentage);
   
   // WAKE WORD PROCESSING: Activate listening mode
   if (detected_word.equals("marvin")) {
       wake_word_detected = true;
       wake_word_timestamp = millis();
       listening_mode = true;
       
       ei_printf("Wake word detected! Listening for commands for %d seconds...\n", LISTENING_DURATION_MS / 1000);
       
       // Visual feedback: Show wake pattern then enter listening mode
       show_wake_word_pattern();
       delay(200); // Brief pause between patterns
       show_listening_mode();
       return;
   }
   
   // COMMAND PROCESSING: Execute LED control commands during active listening window
   if (wake_word_detected && listening_mode) {
       if (detected_word.equals("on")) {
           // Execute LED ON command sequence
           stop_listening_mode();           // End listening mode
           show_turning_on_pattern();       // Visual confirmation
           digitalWrite(CONTROL_LED_PIN, HIGH); // Turn on main LED
           led_state = true;
           ei_printf("LED turned ON\n");
           wake_word_detected = false;      // Reset wake word state
       }
       else if (detected_word.equals("off")) {
           // Execute LED OFF command sequence
           stop_listening_mode();           // End listening mode
           show_turning_off_pattern();      // Visual confirmation
           digitalWrite(CONTROL_LED_PIN, LOW); // Turn off main LED
           led_state = false;
           ei_printf("LED turned OFF\n");
           wake_word_detected = false;      // Reset wake word state
       }
   }
}
// ===============================================================================
// SYSTEM INITIALIZATION
// ===============================================================================
/**
* @brief Arduino setup function - System initialization
* Configures hardware, displays system information, and starts audio processing
*/
void setup()
{
   // -------------------------------------------------------------------------
   // Serial Communication Setup
   // -------------------------------------------------------------------------
   Serial.begin(115200);
   while (!Serial); // Wait for USB connection (comment out for standalone operation)
   Serial.println("Enhanced Voice Control Demo");
   
   // -------------------------------------------------------------------------
   // Hardware Pin Initialization
   // -------------------------------------------------------------------------
   
   // Configure control LED (main device being controlled)
   pinMode(CONTROL_LED_PIN, OUTPUT);
   digitalWrite(CONTROL_LED_PIN, LOW);
   led_state = false;
   
   // Configure indicator LED if enabled (status feedback)
   if (INDICATOR_LED_PIN != -1) {
       pinMode(INDICATOR_LED_PIN, OUTPUT);
       digitalWrite(INDICATOR_LED_PIN, LOW);
   }
   
   // -------------------------------------------------------------------------
   // System Configuration Display
   // -------------------------------------------------------------------------
   ei_printf("Voice control ready. Say 'marvin' then 'on' or 'off'\n");
   ei_printf("Config: Command threshold=%.0f%%, Recognition threshold=%.0f%%\n", 
             COMMAND_CONFIDENCE_THRESHOLD*100, RECOGNITION_CONFIDENCE_THRESHOLD*100);
   ei_printf("LEDs: Control=Pin%d, Indicator=Pin%d\n", CONTROL_LED_PIN, INDICATOR_LED_PIN);
   ei_printf("LED Patterns:\n");
   ei_printf("  - Wake word 'marvin': 2 quick pulses + listening mode (steady on)\n");
   ei_printf("  - Command 'on': Triple flash + confirmation\n");
   ei_printf("  - Command 'off': Fade pattern + confirmation\n");
   ei_printf("  - Other words: Single blink (when not listening)\n");
   // -------------------------------------------------------------------------
   // Edge Impulse Model Information Display
   // -------------------------------------------------------------------------
   ei_printf("Inferencing settings:\n");
   ei_printf("\tInterval: ");
   ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS);
   ei_printf(" ms.\n");
   ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
   ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
   ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));
   // -------------------------------------------------------------------------
   // Audio System Initialization
   // -------------------------------------------------------------------------
   ei_printf("\nStarting continuous inference in 2 seconds...\n");
   ei_sleep(2000);
   // Initialize microphone and audio buffer
   if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
       ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
       return;
   }
   ei_printf("Recording...\n");
}
// ===============================================================================
// MAIN PROCESSING LOOP
// ===============================================================================
/**
* @brief Arduino main loop - Continuous voice processing
* Handles real-time audio capture, ML inference, and command processing
*/
void loop()
{
   // -------------------------------------------------------------------------
   // Audio Capture
   // -------------------------------------------------------------------------
   bool m = microphone_inference_record();
   if (!m) {
       ei_printf("ERR: Failed to record audio...\n");
       return;
   }
   // -------------------------------------------------------------------------
   // Machine Learning Inference
   // -------------------------------------------------------------------------
   
   // Prepare signal structure for Edge Impulse classifier
   signal_t signal;
   signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
   signal.get_data = &microphone_audio_signal_get_data;
   ei_impulse_result_t result = { 0 };
   // Run ML classification on audio data
   EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
   if (r != EI_IMPULSE_OK) {
       ei_printf("ERR: Failed to run classifier (%d)\n", r);
       return;
   }
   // -------------------------------------------------------------------------
   // Command Processing and LED Control
   // -------------------------------------------------------------------------
   handle_wake_word_and_commands(result);
   // -------------------------------------------------------------------------
   // Debug Information Output
   // -------------------------------------------------------------------------
   
   // Display inference timing and classification results
   ei_printf("Predictions ");
   ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
       result.timing.dsp, result.timing.classification, result.timing.anomaly);
   ei_printf(": \n");
   
   // Print all classification probabilities
   for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
       ei_printf("    %s: ", result.classification[ix].label);
       ei_printf_float(result.classification[ix].value);
       ei_printf("\n");
   }
   
   // Print anomaly score if available
#if EI_CLASSIFIER_HAS_ANOMALY == 1
   ei_printf("    anomaly score: ");
   ei_printf_float(result.anomaly);
   ei_printf("\n");
#endif
}
// ===============================================================================
// AUDIO PROCESSING SUBSYSTEM
// ===============================================================================
/**
* @brief Audio data callback function
* Called when new audio samples are available from I2S
* 
* @param n_bytes Number of bytes received
*/
static void audio_inference_callback(uint32_t n_bytes)
{
   // Copy samples from I2S buffer to inference buffer
   for(int i = 0; i < n_bytes>>1; i++) {
       inference.buffer[inference.buf_count++] = sampleBuffer[i];
       // Check if buffer is full
       if(inference.buf_count >= inference.n_samples) {
         inference.buf_count = 0;    // Reset buffer position
         inference.buf_ready = 1;    // Signal buffer ready for processing
       }
   }
}
/**
* @brief FreeRTOS task for continuous audio sample capture
* Runs on separate core for real-time audio processing
* 
* @param arg I2S bytes to read per iteration
*/
static void capture_samples(void* arg) {
   const int32_t i2s_bytes_to_read = (uint32_t)arg;
   size_t bytes_read = i2s_bytes_to_read;
   // Continuous audio capture loop
   while (record_status) {
       // Read audio data from I2S microphone
       i2s_read((i2s_port_t)1, (void*)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
       // Error handling for I2S read operations
       if (bytes_read <= 0) {
           ei_printf("Error in I2S read : %d", bytes_read);
       }
       else {
           // Handle partial reads
           if (bytes_read < i2s_bytes_to_read) {
               ei_printf("Partial I2S read");
           }
           // Audio signal amplification (scale by 8x for better sensitivity)
           for (int x = 0; x < i2s_bytes_to_read/2; x++) {
               sampleBuffer[x] = (int16_t)(sampleBuffer[x]) * 8;
           }
           // Forward samples to inference callback if still recording
           if (record_status) {
               audio_inference_callback(i2s_bytes_to_read);
           }
           else {
               break;
           }
       }
   }
   // Clean up FreeRTOS task when done
   vTaskDelete(NULL);
}
// ===============================================================================
// MICROPHONE CONTROL FUNCTIONS
// ===============================================================================
/**
* @brief Initialize microphone inference system
* Allocates buffers and starts audio capture task
* 
* @param n_samples Number of audio samples in buffer
* @return true if successful, false if memory allocation failed
*/
static bool microphone_inference_start(uint32_t n_samples)
{
   // Allocate memory for audio sample buffer
   inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));
   if(inference.buffer == NULL) {
       return false;
   }
   // Initialize inference structure
   inference.buf_count  = 0;       // Reset buffer position
   inference.n_samples  = n_samples;  // Set buffer size
   inference.buf_ready  = 0;       // Buffer not ready initially
   // Initialize I2S audio interface
   if (i2s_init(EI_CLASSIFIER_FREQUENCY)) {
       ei_printf("Failed to start I2S!");
   }
   ei_sleep(100); // Allow I2S to stabilize
   record_status = true; // Enable recording
   // Create FreeRTOS task for audio capture (high priority task)
   xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void*)sample_buffer_size, 10, NULL);
   return true;
}
/**
* @brief Wait for microphone inference buffer to be ready
* Blocks until audio buffer contains new samples for processing
* 
* @return true when buffer is ready
*/
static bool microphone_inference_record(void)
{
   bool ret = true;
   // Wait for buffer to be filled by capture task
   while (inference.buf_ready == 0) {
       delay(10);
   }
   inference.buf_ready = 0; // Reset ready flag
   return ret;
}
/**
* @brief Convert audio samples to float format for ML processing
* Edge Impulse callback function for accessing audio data
* 
* @param offset Starting position in buffer
* @param length Number of samples to convert
* @param out_ptr Output buffer for float samples
* @return 0 on success
*/
static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)
{
   numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);
   return 0;
}
/**
* @brief Clean up microphone inference resources
* Stops I2S and frees allocated memory
*/
static void microphone_inference_end(void)
{
   i2s_deinit();
   ei_free(inference.buffer);
}
// ===============================================================================
// I2S AUDIO INTERFACE CONFIGURATION
// ===============================================================================
/**
* @brief Initialize I2S audio interface for microphone input
* Configures ESP32 I2S peripheral for digital microphone communication
* 
* @param sampling_rate Audio sampling frequency (typically 8kHz or 16kHz)
* @return 0 on success, error code on failure
*/
static int i2s_init(uint32_t sampling_rate) {
   // I2S Configuration Structure
   i2s_config_t i2s_config = {
       .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX | I2S_MODE_TX), // Master mode with RX
       .sample_rate = sampling_rate,                    // Set by Edge Impulse model
       .bits_per_sample = (i2s_bits_per_sample_t)16,   // 16-bit audio samples
       .channel_format = I2S_CHANNEL_FMT_ONLY_RIGHT,   // Mono audio (right channel)
       .communication_format = I2S_COMM_FORMAT_I2S,    // Standard I2S protocol
       .intr_alloc_flags = 0,                          // Default interrupt allocation
       .dma_buf_count = 8,                             // Number of DMA buffers
       .dma_buf_len = 512,                             // DMA buffer length
       .use_apll = false,                              // Use PLL for clock generation
       .tx_desc_auto_clear = false,                    // Don't auto-clear TX descriptors
       .fixed_mclk = -1,                               // Auto-calculate master clock
   };
   
   // I2S Pin Configuration for INMP441 Digital Microphone
   i2s_pin_config_t pin_config = {
       .bck_io_num = 26,    // Bit Clock (SCLK)
       .ws_io_num = 25,     // Word Select (LRCLK)
       .data_out_num = -1,  // Data Output (not used for input-only)
       .data_in_num = 33,   // Data Input (SD pin from microphone)
   };
   
   esp_err_t ret = 0;
   // Install I2S driver
   ret = i2s_driver_install((i2s_port_t)1, &i2s_config, 0, NULL);
   if (ret != ESP_OK) {
       ei_printf("Error in i2s_driver_install");
   }
   // Configure I2S pins
   ret = i2s_set_pin((i2s_port_t)1, &pin_config);
   if (ret != ESP_OK) {
       ei_printf("Error in i2s_set_pin");
   }
   // Clear DMA buffers
   ret = i2s_zero_dma_buffer((i2s_port_t)1);
   if (ret != ESP_OK) {
       ei_printf("Error in initializing dma buffer with 0");
   }
   return int(ret);
}
/**
* @brief Deinitialize I2S audio interface
* Stops and uninstalls I2S driver
* 
* @return 0 on success
*/
static int i2s_deinit(void) {
   i2s_driver_uninstall((i2s_port_t)1); // Stop and destroy I2S driver
   return 0;
}
// ===============================================================================
// COMPILE-TIME VALIDATION
// ===============================================================================
// Ensure correct sensor type is configured in Edge Impulse model
#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE
#error "Invalid model for current sensor."
#endif
// ===============================================================================
// END OF FILE
// ===============================================================================
Have any question related to this Article?

Add New Comment

Login to Comment Sign in with Google Log in with Facebook Sign in with GitHub