Giving Voice to Smart Products

Published  August 17, 2023   0
S Staff
Smart Products with Voice Command

‘It is not only about what you say. It is also about how you say it.’ This old-age adage quite aptly sums up the need for human beings to communicate effectively with each other. The necessity of humans to interconnect with one another through voice and sounds has presented a future where communication with machines has become inevitable.

A key enabler for the increasing adoption of voice communication has been accelerated with the expansion of the Internet of Things (IoT) and artificial intelligence (AI). Integration of AI at the endpoint— combined with advances in voice analytics—is changing the availability of products, and the consumption of product experiences are giving rise to a new ecosystem of companies that are participants and enablers of these products. Intelligent endpoint solutions are making it possible to implement both online and offline systems, reducing reliance on always-on internet/cloud connections. This, in turn, is creating new opportunities to solve many challenges related to real-time voice analytics across several consumer and industrial applications. The advances in psycholinguistic data analytics and affective computing make allowance for inferring emotions, attitudes, and intent with data-driven voice modeling. With the voice medium becoming a natural way for humans to interact, it will lead to improvements in measuring intent from voice recognition and voice analytics.

Challenges of Using VUIs

Voice user interfaces (VUIs) allow the user to interact with endpoint systems through voice or speech commands. Despite mass deployments across a wide range of applications, VUIs have some limitations.

  • Poor sound quality: Inconsistent sound quality with continued background noise can make voice recognition a challenge. Voice controllers in IoT can only operate flawlessly if the sound is crystal clear, which is a formidable task in a noisy environment. A voice-enabled assistant can only be truly effective if it is able to support different languages and accents, as well as isolate the human voice from the background noise.
  • Power consumption: Voice Command systems are restrictive as they require the activation of at least one microphone as well as the processor that recognizes the wake word.
  • Real-time processing: Slow or congested networks can result in command latencies that can impact the user experience. This issue may be addressed by implementing distributed intelligence at the endpoint with the ability to process the voice command in real time without any reliance on the centralized cloud system.
  • Accuracy and noise immunity: Voice recognition accuracy and background noise immunity are always major concerns when designing any VUI system. Voice recognition presents a number of challenges as there can be multiple sound sources, including interior and exterior noise and echoes from surfaces in the room, etc. Isolating the source of a command, canceling echoes, and reducing background noise require sophisticated technology depending on multiple microphones, beamforming, echo cancellation, and noise suppression.

Renesas Electronics provides general-purpose MCUs enabling VUI integration without compromising performance and power consumption.

Requirements for Robust Voice Recognition

To make the experience compelling for the user, devices need to be equipped with several components to ensure robust voice recognition.

Command Recognition

One of the most significant features of a voice-enabled device is its ability to identify speech commands from an audio input. The speech command recognition system on the device is activated by the wake word, which then takes the input, interprets it, and transcribes it to text. This text ultimately serves the purpose of the input or command to perform the specific task.

Voice Activity Detection

Voice activity detection (VAD) is the process that distinguishes human speech from the audio signal and background noise. VAD is further utilized to improve the optimization of overall system power consumption otherwise; the system needs to be active all the time, resulting in unnecessary power consumption. The VAD algorithm can be subdivided into four stages (Figure 1):

VAD algorithm block diagram
Figure 1: The block diagram specifies the four stages of the VAD algorithm: noise minimization, segregation, classification, and response. (Source: Renesas Electronics)

The Renesas RA voice command solution built on the RA MCU family and partner-enabled voice recognition MW boasts a robust noise reduction technique that helps in ensuring high accuracy in VAD. In addition, Renesas can help to address some of the key voice command features outlined below:

Keyword Spotting

Keyword spotting systems (KWS) are one of the key features of any voice-enabled device. The KWS relies on speech recognition to identify the keywords and phrases. These words trigger and initiate the recognition process at the endpoint, allowing the audio to correspond to the rest of the query (Figure 2).

keyword spotting process
Figure 2: The diagram illustrates the keyword spotting process, which relies on speech recognition to identify the keywords and phrases, the identified keywords and phrases triggering and initiating the recognition process at the endpoint, and allowing the audio to correspond to the rest of the query. (Source: Renesas Electronics)

To contribute to a better hands-free user experience, the KWS is required to provide highly accurate real-time responses. This places an immense constraint on the KWS power budget. Therefore, Renesas provides partner-enabled high-performance optimized machine learning (ML) models capable of running on advanced 32-bit RA microcontrollers. They come with pre-trained DNN models, which help in achieving high accuracy when performing keyword spotting.

Speaker Identification

Speaker identification, as the name suggests, is the process of identifying which registered speaker has the given voice input (Figure 3). Speaker recognition can be classified as text dependent, text independent, and text prompted. To train the DNN for speaker identification, individual idiosyncrasies such as dialect, pronunciation, prosody (rhythmic patterns of speech), and phone usage are obtained.

Speaker ID System Block Diagram
Figure 3: Speaker identification system block diagram illustrates the process of training the DNN for speaker identification and individual speech idiosyncrasies. (Source: Renesas Electronics)

Voice/Sound Anti-Spoofing

Spoofing is a type of scam where the intruder attempts to gain unauthorized access to a system by pretending to be the target speaker. This can be countered by including anti-spoofing software to ensure the security of the system. The spoofing attacks are usually against Automatic Speaker Verification (ASV) systems (Figure 4). The spoofed speech samples can be generated using speech synthesis, voice conversion, or by just replaying recorded speech. These attacks can be classified as direct or indirect depending on how they interact with the ASV system.

  • Direct Attack: This can occur through the sensor at the microphone and transmission level and is also known as Physical Access.
  • Indirect Attack: This is an intrusion into the feature extraction, models, and the decision-making process of the ASV system software and is also known as Logical Access attack.
Automatic Speaker Verification Block Diagram
Figure 4: Block representation of an automatic speaker verification. (Source: Renesas Electronics)

Multi-Language/Accent Recognition and Understanding

Accent recognition in English-speaking countries is a much smoother process due to the availability of training data, hence accurate predictions. The downside for organizations operating in countries where English is not the first language is less precision with speech recognition due to the availability of a limited amount of data. An inadequate amount of training data makes building conversational models of high accuracy challenging.

To overcome the accent recognition issue, Renesas offers a VUI partner-enabled solutions that support more than 44 languages, making it a highly adaptable speech recognition solution that can be used by any organization worldwide.

Original Source: Mouser

Have any question realated to this Article?

Ask Our Community Members