Speechless Home Automation with ESP32-S3 Box 3

Published  January 19, 2026   0
u uploader
Author
Speechless Home Automation with ESP32-S3 Box 3

By Arun Kumar

Human–machine interaction today is heavily centered around speech.

Voice assistants, smart speakers, and voice-controlled home automation systems assume that the user can speak clearly and consistently.

However, this assumption excludes an important group of users:

  • People with speech impairments
  • Individuals who are temporarily unable to speak
  • Users with neurological or motor speech disorders
  • Situations where speaking is difficult, unsafe, or undesirable

For these users, conventional voice assistants are either unreliable or completely unusable.

This project explores an alternative form of interaction that does not require speech at all.

  • Instead of spoken words, the system responds to custom, non-verbal sounds such as taps, clicks, knocks, or simple vocal tones, sounds that are often still possible even when speech is not.

This project explores an alternative approach:

Can a device be controlled using sound intent instead of spoken language?

Proposed Solution:

We propose a voiceless home automation system that operates using:

  • Custom non-verbal sounds (taps, clicks, tones)
  • Lightweight on-device signal processing
  • No internet, no cloud, no speech recognition

The system listens for:

  1. A wake sound to activate
  2. A command sound to trigger an action

Each sound is trained by the user, making the system:

  • Language-independent
  • User-adaptive
  • Fully offline

Key Innovations

1. Voiceless Interaction Model

Instead of recognizing words, the system recognizes acoustic patterns:

  • Energy envelope
  • Spectral distribution
  • Temporal consistency

2. This makes the system:

  • Faster than speech recognition
  • Less computationally expensive
  • More robust for embedded hardware

3. On-Device Training

Users can record:

  • Wake sound
  • Multiple task sounds

These references are stored in non-volatile memory (NVS) and persist across reboots.

Speechless-Home-Automation-Using-ESP32-S3 Box-3

Components Required

ESP32S3 Box 3   

ESP32-S3 Box-3 with DigiKey Box

Circuit Diagram

Circuit Diagram Speechless Home Automation using ESP32 S3 Box 3

Hardware Assembly

The ESP32S3 box 3 is mounted on the GPIO dock and the components for testing are connected using the GPIO pins.

Hardware Assembly Speechless Home Automation

Controller:

  • ESP32S3 Box 3
  • Integrated:
    • Microphone
    • Audio output
    • Touchscreen display

Controlled Loads

  • AC Bulb (220V) via relay module
  • DC Motor / Fan via MOSFET trigger switch

This allows demonstration of:

  • High-voltage AC control
  • Low-voltage DC control
  • Safe isolation between logic and load

Safety Considerations

  • AC loads are isolated using a relay module
  • No direct connection between mains voltage and ESP32
  • DC motor uses MOSFET switching, not direct GPIO drive
  • Audio and control paths are mutually exclusive via mutex locking

Safety was considered at both electrical and software levels.

Fully Assembled Speechless Home Automation

Code Explanation

The project does not rely on speech recognition or cloud-based voice assistants. Instead, it uses simple, reliable audio signal characteristics to identify intentional sounds made by the user. To separate and identify sounds, the system uses three methods together.

A sound is accepted only if it passes all three checks, which greatly reduces false triggers from background noise.

Overview of the Three Methods

  • Loudness Envelope (RMS over time)
  • Coarse Spectral Distribution
  • Temporal Shape Matching (Correlation)

Each recorded sound (wake sound or command sound) is stored as a reference pattern and later compared against live audio.

1. Loudness Envelope (RMS)

The first method measures how loud the sound is over time. The microphone audio is divided into short frames. For each frame, the system calculates the Root Mean Square (RMS) value, which represents loudness. This produces a loudness envelope, showing how the sound grows and fades.

Loudness Envelope

Why this matters: Different intentional sounds have different loudness shapes:

A tap or click rises and falls very quickly
A hum or sustained sound rises slowly and lasts longer

/* RMS = loudness of one frame */
static float compute_rms(int16_t *buf, int samples)
{
   double sum = 0;
   for (int i = 0; i < samples; i++)
   {
       float v = buf[i] / 32768.0f;
       sum += v * v;
   }
   return sqrt(sum / samples);
}

2. Coarse Spectral Distribution

The second method looks at where the energy of the sound is concentrated.

Instead of performing a full FFT (which is expensive and unnecessary), the signal is split into four coarse frequency bands:

Very low, Low–mid, Mid–high, High. Each band accumulates energy from the audio samples.

Coarse Spectral Distribution

Why this matters: Different sounds excite different frequency ranges:

  • Taps and clicks → more high-frequency energy
  • Hums and breath sounds → more low-frequency energy

This allows the system to distinguish between: A finger tap, a vocal hum or random background noise.

/* spectral centroid = where energy sits (low vs high freq) */
static float compute_centroid(float bands[SPEC_BANDS])
{
   float num = 0, den = 0;
   for (int i = 0; i < SPEC_BANDS; i++)
   {
       num += i * bands[i];
       den += bands[i];
   }
   return (den > 0) ? (num / den) : 0;
}

3. Temporal Shape Matching (Correlation)

The third method compares how the sound evolves over time, not just how loud it is. Even if two sounds have similar loudness, their time structure may differ.

To capture this, the system performs

  • Correlation between the recorded RMS envelope
  • Correlation between the live RMS envelope
Temporal Shape Matching

Why this matters: This prevents false triggers such as

  • Someone speaking nearby
  • Random knocks or table vibrations
  • Environmental noise bursts

Only sounds with a similar temporal pattern to the recorded reference are accepted.

/* envelope shape similarity (NOT loudness) */
static float envelope_correlation(float *a, float *b)
{
   float num = 0, da = 0, db = 0;
   for (int i = 0; i < EVENT_FRAMES; i++)
   {
       num += a[i] * b[i];
       da += a[i] * a[i];
       db += b[i] * b[i];
   }
   return num / (sqrtf(da * db) + 1e-6f);
}

User Interface

Main Screen

  • Large touch-friendly buttons
  • Visual indication of task states
  • Status bar showing
    • Listening
    • Wake detected
    • Command executed

Settings Screen

  • Record wake sound
  • Record task sounds
  • Minimal interaction required

The UI is designed for

  • Fast interaction
  • Clear feedback
  • Embedded constraints

This project demonstrates that meaningful human–machine interaction does not require speech recognition.

By combining

  • Embedded signal processing
  • State-driven software design
  • Safe hardware interfacing

we achieve a system that is

  • Efficient
  • Private
  • Robust
  • Innovative

This approach opens new possibilities for accessible, offline, and low-cost automation systems.

Speechless Home Automation Using ESP32S3 Box 3 GitHub Repository

Video

Have any question related to this Article?

Add New Comment

Login to Comment Sign in with Google Log in with Facebook Sign in with GitHub