Speechless Home Automation Using Custom Sound Recognition on ESP32S3 Box 3

Published January 19, 2026 0

u uploader
Author

Speechless Home Automation with ESP32-S3 Box 3

By Arun Kumar

Human–machine interaction today is heavily centered around speech.

Voice assistants, smart speakers, and voice-controlled home automation systems assume that the user can speak clearly and consistently.

However, this assumption excludes an important group of users:

People with speech impairments
Individuals who are temporarily unable to speak
Users with neurological or motor speech disorders
Situations where speaking is difficult, unsafe, or undesirable

For these users, conventional voice assistants are either unreliable or completely unusable.

This project explores an alternative form of interaction that does not require speech at all.

Instead of spoken words, the system responds to custom, non-verbal sounds such as taps, clicks, knocks, or simple vocal tones, sounds that are often still possible even when speech is not.

This project explores an alternative approach:

Can a device be controlled using sound intent instead of spoken language?

Proposed Solution:

We propose a voiceless home automation system that operates using:

Custom non-verbal sounds (taps, clicks, tones)
Lightweight on-device signal processing
No internet, no cloud, no speech recognition

The system listens for:

A wake sound to activate
A command sound to trigger an action

Each sound is trained by the user, making the system:

Language-independent
User-adaptive
Fully offline

Key Innovations

1. Voiceless Interaction Model

Instead of recognizing words, the system recognizes acoustic patterns:

Energy envelope
Spectral distribution
Temporal consistency

2. This makes the system:

Faster than speech recognition
Less computationally expensive
More robust for embedded hardware

3. On-Device Training

Users can record:

Wake sound
Multiple task sounds

These references are stored in non-volatile memory (NVS) and persist across reboots.

Components Required

ESP32S3 Box 3

Circuit Diagram

Hardware Assembly

The ESP32S3 box 3 is mounted on the GPIO dock and the components for testing are connected using the GPIO pins.

Controller:

ESP32S3 Box 3
Integrated:
- Microphone
- Audio output
- Touchscreen display

Controlled Loads

AC Bulb (220V) via relay module
DC Motor / Fan via MOSFET trigger switch

This allows demonstration of:

High-voltage AC control
Low-voltage DC control
Safe isolation between logic and load

Safety Considerations

AC loads are isolated using a relay module
No direct connection between mains voltage and ESP32
DC motor uses MOSFET switching, not direct GPIO drive
Audio and control paths are mutually exclusive via mutex locking

Safety was considered at both electrical and software levels.

Fully Assembled Speechless Home Automation

Code Explanation

The project does not rely on speech recognition or cloud-based voice assistants. Instead, it uses simple, reliable audio signal characteristics to identify intentional sounds made by the user. To separate and identify sounds, the system uses three methods together.

A sound is accepted only if it passes all three checks, which greatly reduces false triggers from background noise.

Overview of the Three Methods

Loudness Envelope (RMS over time)
Coarse Spectral Distribution
Temporal Shape Matching (Correlation)

Each recorded sound (wake sound or command sound) is stored as a reference pattern and later compared against live audio.

1. Loudness Envelope (RMS)

The first method measures how loud the sound is over time. The microphone audio is divided into short frames. For each frame, the system calculates the Root Mean Square (RMS) value, which represents loudness. This produces a loudness envelope, showing how the sound grows and fades.

Why this matters: Different intentional sounds have different loudness shapes:

A tap or click rises and falls very quickly
A hum or sustained sound rises slowly and lasts longer

/* RMS = loudness of one frame */
static float compute_rms(int16_t *buf, int samples)
{
   double sum = 0;
   for (int i = 0; i < samples; i++)
   {
       float v = buf[i] / 32768.0f;
       sum += v * v;
   }
   return sqrt(sum / samples);
}

2. Coarse Spectral Distribution

The second method looks at where the energy of the sound is concentrated.

Instead of performing a full FFT (which is expensive and unnecessary), the signal is split into four coarse frequency bands:

Very low, Low–mid, Mid–high, High. Each band accumulates energy from the audio samples.

Why this matters: Different sounds excite different frequency ranges:

Taps and clicks → more high-frequency energy
Hums and breath sounds → more low-frequency energy

This allows the system to distinguish between: A finger tap, a vocal hum or random background noise.

/* spectral centroid = where energy sits (low vs high freq) */
static float compute_centroid(float bands[SPEC_BANDS])
{
   float num = 0, den = 0;
   for (int i = 0; i < SPEC_BANDS; i++)
   {
       num += i * bands[i];
       den += bands[i];
   }
   return (den > 0) ? (num / den) : 0;
}

3. Temporal Shape Matching (Correlation)

The third method compares how the sound evolves over time, not just how loud it is. Even if two sounds have similar loudness, their time structure may differ.

To capture this, the system performs

Correlation between the recorded RMS envelope
Correlation between the live RMS envelope

Why this matters: This prevents false triggers such as

Someone speaking nearby
Random knocks or table vibrations
Environmental noise bursts

Only sounds with a similar temporal pattern to the recorded reference are accepted.

/* envelope shape similarity (NOT loudness) */
static float envelope_correlation(float *a, float *b)
{
   float num = 0, da = 0, db = 0;
   for (int i = 0; i < EVENT_FRAMES; i++)
   {
       num += a[i] * b[i];
       da += a[i] * a[i];
       db += b[i] * b[i];
   }
   return num / (sqrtf(da * db) + 1e-6f);
}

User Interface

Main Screen

Large touch-friendly buttons
Visual indication of task states
Status bar showing
- Listening
- Wake detected
- Command executed

Settings Screen

Record wake sound
Record task sounds
Minimal interaction required

The UI is designed for

Fast interaction
Clear feedback
Embedded constraints

This project demonstrates that meaningful human–machine interaction does not require speech recognition.

By combining

Embedded signal processing
State-driven software design
Safe hardware interfacing

we achieve a system that is

Efficient
Private
Robust
Innovative

This approach opens new possibilities for accessible, offline, and low-cost automation systems.

Speechless Home Automation Using ESP32S3 Box 3 GitHub Repository

Video

Start a Discussion on:

Discord

Forum

Add New Comment

Comment *

DigiKey featured products logo

	PolarFire® Core FPGAs and SoC FPGAs Experience low power, high security, and reliable performance with PolarFire® Core & SoC FPGAs.
	AIROC™ CYW55913/2/1 Connected MCU Redefining Connectivity: AIROC™ CYW55913 modules for IoT, Smart Home, Industrial
	DCM3717 High-Density 48 V DC/DC Converter Modules Innovate with 48V architectures and eliminate the risk of reengineering 12V systems
	0900AT47A0063001E Ceramic SMD Chip Antenna Compact 868/902–928 MHz SMD antenna for IoT, LoRaWAN, Zigbee®, sensors, and asset tracking
	DG11/DG12 Series IEC Inlet with Circuit Breaker SCHURTER's next-generation power entry module with IEC inlet and IP67-rated circuit breaker
	RNWF02 Plug-and-Play Wi-Fi® Modules Experience reliable, high-performance wireless connectivity with RNWF02 today.
	TMF882x Series Optical Distance Sensors TMF882x sensors are designed for high-performance proximity and distance measurement
	SMD Trimmer Potentiometers Nidec Components’ SMD J-hook or gull wing trimmers are 1 to 14 turn with side or top adjustment