Real-time Communication | Ziyang Lin

VAD Technical Guide: Principles and Practices of Voice Activity Detection

Thu, 26 Jun 2025 02:00:00 +0000

1. VAD Technology Overview: A Macro-level Understanding

1.1 What is VAD?

VAD (Voice Activity Detection) is a technology designed to accurately identify the presence of human speech in audio signals. Its core task is to segment an audio stream into two parts: segments containing speech and silent/noise segments without speech.

From a macro perspective, VAD serves as the “gatekeeper” or “preprocessor” in the speech processing pipeline. It is crucial and typically the first step in any system that needs to process human speech.

graph TD
A["Raw Audio Stream"] --> B{"VAD Module"}
B -->|"Speech Detected"| C["Speech Segments"]
B -->|"No Speech Detected"| D["Silence/Noise Segments"]
C --> E["Further Processing: ASR, Voice Print, etc."]
D --> F["Discard or Use for Noise Modeling"]

1.2 Why is VAD So Important?

The value of VAD is reflected in several key aspects:

Conserving Computational Resources: In compute-intensive tasks like automatic speech recognition (ASR), processing only detected speech segments avoids unnecessary computation on silence and background noise, saving 50% or more of CPU/GPU resources.
Improving Downstream Task Accuracy: Removing silent segments reduces interference for ASR models, voice print recognition models, or emotion analysis models, thereby improving their accuracy.
Optimizing Network Bandwidth: In real-time voice communication (like VoIP, WebRTC), silent segments can be either not transmitted or transmitted at extremely low bit rates (known as “Discontinuous Transmission”, DTX), significantly reducing network bandwidth usage.
Enhancing User Experience: In smart assistants and voice interaction scenarios, precise VAD enables more natural interaction, avoiding premature interruption of recognition during user pauses or false triggering in noisy environments.
Data Preprocessing and Annotation: When building large speech datasets, VAD can automatically segment and annotate effective speech segments, greatly improving data processing efficiency.

2. Traditional VAD Implementation Methods

Before deep learning became popular, VAD primarily relied on manually designed acoustic features. These methods are computationally simple and fast but have poor robustness in complex noisy environments.

The main methods include:

Energy-based: The simplest method. It's generally assumed that the short-time energy of speech signals is much greater than background noise. Speech and silence are distinguished by setting an energy threshold.
- Advantage: Extremely simple computation.
- Disadvantage: Very sensitive to noise and volume changes, with thresholds difficult to set.
Zero-Crossing Rate (ZCR): ZCR describes the frequency at which a signal crosses zero. Unvoiced sounds (like ‘s’) have a higher ZCR, while voiced sounds and background noise have a lower ZCR.
- Advantage: Not sensitive to broadband noise.
- Disadvantage: Poor discrimination between certain unvoiced sounds and noise.
Spectral Features: Such as spectral entropy, spectral flatness, etc. Speech signals typically have more complex and regular spectral structures than noise, resulting in lower spectral entropy and less flat spectra.
Combined Features: In practical applications, multiple features (such as energy+ZCR) are often combined with smoothing filter techniques to enhance stability. The famous WebRTC VAD is a classic example based on Gaussian Mixture Models (GMM), extracting features across multiple frequency bands with good performance and efficiency.

3. Deep Learning-based VAD

With the development of deep learning, neural network-based VAD methods far outperform traditional methods, especially in low signal-to-noise ratio (SNR) and complex noise environments. The core idea is to let the model automatically learn the distinguishing features between speech and non-speech from data, rather than relying on manually designed rules.

The general workflow for these models is as follows:

graph TD
A["Audio Input"] --> B["Feature Extraction<br>(e.g., MFCC, Fbank)"]
B --> C["Deep Neural Network<br>(CNN, RNN, Transformer, etc.)"]
C --> D["Output Layer<br>(Sigmoid/Softmax)"]
D --> E["Speech/Non-speech Probability"]
E --> F{"Post-processing<br>(Threshold, Smoothing)"}
F --> G["Final Decision"]

4. In-depth Analysis of the Silero VAD Model

Silero VAD is one of the leading VAD models in the industry, renowned for its extremely high accuracy, amazing computational efficiency, and universality across multiple languages. Its achievements are primarily based on the snakers4/silero-vad repository.

4.1 Core Features

High Precision: Its accuracy rivals or even surpasses many large, complex models in various noisy environments.
Extremely Lightweight: The model size is very small (typically less than 1MB), making it easy to deploy on browsers, mobile devices, and even embedded systems.
Language-Independent: It is not trained on specific languages but learns the universal acoustic characteristics of human speech, making it effective for almost all languages worldwide.
Real-time Performance: Extremely low processing latency, making it ideal for real-time communication applications.

4.2 Model Architecture

The core architecture of Silero VAD is a hybrid CNN + GRU model. This architecture combines the advantages of both:

CNN (Convolutional Neural Network): Used to extract local features with translation invariance from raw audio or spectrograms. CNNs can effectively capture the instantaneous characteristics of sound events.
GRU (Gated Recurrent Unit): A type of RNN (Recurrent Neural Network) used to process sequential data. It can capture the contextual dependencies of audio signals in the time dimension, such as the beginning and end of a syllable.

Its detailed architecture can be macroscopically understood as:

graph TD
subgraph "Silero VAD Model"
A["Input Audio Chunk<br> (e.g., 30ms, 16kHz)"] --> B("Single-layer CNN")
B --> C("Multi-layer GRU")
C --> D("Fully Connected Layer")
D --> E["Output<br>(Sigmoid Activation)"]
end
E --> F["Speech Probability (0-1)"]

Diving into the details:

Input: The model receives a small segment of audio as input, such as a 480-sample chunk (equivalent to 30 milliseconds at a 16kHz sampling rate). The model processes chunk-by-chunk.
Feature Extraction: Unlike many models, Silero VAD may operate directly on raw waveforms or very low-level features, with the first CNN layer automatically learning effective acoustic features, rather than relying on manually designed features like MFCC.
CNN Layer: This layer acts like a filter bank, scanning the input audio chunk to capture phoneme-level micro-patterns.
GRU Layer: This is the memory core of the model. The feature vector of each audio chunk after CNN processing is fed into the GRU. The internal state of the GRU is updated based on the current input and the previous state. This allows the model to understand “whether the sound I'm hearing now is a continuation of the previous sound or the beginning of a completely new sound event.” This is crucial for accurately judging the first word after a long silence or brief pauses in the middle of a sentence.
Fully Connected Layer & Output: The output of the GRU goes through one or more fully connected layers for integration, and finally through a Sigmoid function, outputting a floating-point number between 0 and 1. This number represents the probability that the current input audio chunk contains speech.

4.3 Technical Implementation Details

State Maintenance (Stateful): To process continuous audio streams, Silero VAD is a stateful model. You need to maintain an internal state of the model (mainly the hidden state of the GRU) for each independent audio stream. After processing an audio chunk, the model's hidden state needs to be saved and used as input for processing the next audio chunk. This enables uninterrupted real-time detection.
Sampling Rate Support: Typically supports 8kHz and 16kHz, which are the most common sampling rates in voice communication.
Audio Chunk Size: The model has strict requirements for the size of input audio chunks, such as 256, 512, 768 (8kHz) or 512, 1024, 1536 (16kHz) samples. Developers need to buffer and segment the audio stream from microphones or networks into these fixed-size chunks.
Post-processing: The model only outputs the speech probability for a single chunk. In practical applications, a simple post-processing logic is also needed. For example:
- trigger_level: Speech activation threshold (e.g., 0.5).
- speech_pad_ms: Additional audio retention after the speech end signal is issued, to prevent premature cutting.
- min_silence_duration_ms: Minimum duration required to be classified as a silence segment.
- min_speech_duration_ms: Minimum duration required to be classified as a speech segment, preventing brief noises (like coughs) from being misclassified as speech.

5. Application of VAD in Real-time Voice Communication

5.1 Frontend Applications (Browser/Client)

Running VAD on the frontend allows processing of voice data before it leaves the user's device, achieving maximum bandwidth savings and minimal latency.

Typical Scenarios: Web-based online meetings, browser-embedded customer service dialogue systems.

Implementation Process:

sequenceDiagram
participant User
participant Mic as Microphone
participant Browser
participant VAD as "Silero VAD (WASM/ONNX.js)"
participant Network as Network Module
User->>Mic: Start speaking
Mic->>Browser: Capture raw audio stream
Browser->>Browser: Get audio via WebAudio API
Note right of Browser: "Create AudioContext and<br>ScriptProcessorNode/AudioWorklet"
loop Real-time Processing
Browser->>VAD: Pass fixed-size audio chunk
VAD->>VAD: Calculate speech probability
VAD-->>Browser: "Return speech probability (e.g., 0.9)"
end
Browser->>Browser: Judge based on probability and post-processing logic
alt Speech Detected
Browser->>Network: Encode and send the audio chunk
else No Speech Detected
Browser->>Network: Discard audio chunk or send DTX signal
end

Technology Stack:

Audio Capture: navigator.mediaDevices.getUserMedia()
Audio Processing: Web Audio API (AudioContext, AudioWorkletNode)
VAD Model Running:
- WebAssembly (WASM): Compile the VAD inference engine implemented in C++/Rust into WASM for near-native performance. Silero officially provides such an implementation.
- ONNX.js / TensorFlow.js: Convert the VAD model to ONNX or TF.js format to run directly in JavaScript, simpler to deploy but slightly lower performance than WASM.

5.2 Backend Applications (Server)

Running VAD on the backend allows centralized processing of all incoming audio streams, suitable for scenarios where client behavior cannot be controlled, or server-side recording and analysis are needed.

Typical Scenarios: ASR as a service, mixing and recording of multi-party calls, intelligent voice monitoring.

Implementation Process:

sequenceDiagram
participant Client
participant Server as "Voice Server (e.g., WebRTC SFU)"
participant VAD as Backend VAD Module
participant ASR as ASR Service
Client->>Server: "Send continuous audio stream (RTP packets)"
Server->>VAD: Feed decoded audio stream into VAD module
Note right of VAD: "Maintain an independent VAD state<br>for each client connection"
loop Real-time Processing
VAD->>VAD: Process chunk by chunk, calculate speech probability
VAD-->>Server: "Return 'Speech Start' / 'Speech Continue' / 'Speech End' events"
end
alt "Speech Start" Event
Server->>ASR: Create a new ASR task, start sending subsequent speech data
else "Speech End" Event
Server->>ASR: End the ASR task, get recognition results
end

Technology Stack:

Voice Server: Open-source projects like livekit, ion-sfu, or self-developed media servers.
VAD Module: Typically implemented in Python, C++, or Go, directly calling Silero's PyTorch model or its ONNX/C++ implementation.
Inter-service Communication: If VAD is an independent microservice, gRPC or message queues can be used to communicate with the main business server.

6. Summary and Outlook

Although VAD seems like a simple task, it is the cornerstone of building efficient, intelligent voice applications.

Traditional VAD is simple and fast but struggles in complex scenarios.
Modern deep learning VAD represented by Silero VAD, through clever model design, has achieved a perfect balance in accuracy, efficiency, and universality, pushing high-quality VAD technology to unprecedented popularity, making it easy to deploy on any device from cloud to edge.

In the future, VAD technology may evolve in more refined directions, such as:

Deeper integration with noise suppression: Not just detecting speech, but directly outputting clean speech.
Multimodal detection: Combining lip movement information from video (Lip-VAD) to achieve even greater accuracy.
More complex acoustic scene understanding: Not only distinguishing between speech and non-speech but also differentiating between different types of non-speech (such as music, applause, environmental noise), providing richer contextual information for downstream tasks.

WebRTC Technical Guide: Web-Based Real-Time Communication Framework

Thu, 26 Jun 2025 01:00:00 +0000

1. Introduction

WebRTC (Web Real-Time Communication) is an open-source technology that enables real-time voice and video communication in web browsers. It allows direct peer-to-peer (P2P) audio, video, and data sharing between browsers without requiring any plugins or third-party software.

The main goal of WebRTC is to provide high-quality, low-latency real-time communication, making it easy for developers to build rich communication features into web applications.

Core Advantages

Cross-platform and browser compatibility: WebRTC is an open standard by W3C and IETF, widely supported by major browsers (Chrome, Firefox, Safari, Edge).
No plugins required: Users can use real-time communication features directly in their browsers without downloading or installing any extensions.
Peer-to-peer communication: When possible, data is transmitted directly between users, reducing server bandwidth pressure and latency.
High security: All WebRTC communications are mandatorily encrypted (via SRTP and DTLS), ensuring data confidentiality and integrity.
High-quality audio and video: WebRTC includes advanced signal processing components like echo cancellation, noise suppression, and automatic gain control to provide excellent audio/video quality.

2. Core Concepts

WebRTC consists of several key JavaScript APIs that work together to enable real-time communication.

2.1. `RTCPeerConnection`

RTCPeerConnection is the core interface of WebRTC, responsible for establishing and managing connections between two peers. Its main responsibilities include:

Media negotiation: Handling parameters for audio/video codecs, resolution, etc.
Network path discovery: Finding the best connection path through the ICE framework.
Connection maintenance: Managing the connection lifecycle, including establishment, maintenance, and closure.
Data transmission: Handling the actual transmission of audio/video streams (SRTP) and data channels (SCTP/DTLS).

An RTCPeerConnection object represents a WebRTC connection from the local computer to a remote peer.

2.2. `MediaStream`

The MediaStream API represents streams of media content. A MediaStream object can contain one or more media tracks (MediaStreamTrack), which can be:

Audio tracks (AudioTrack): Audio data from a microphone.
Video tracks (VideoTrack): Video data from a camera.

Developers typically use the navigator.mediaDevices.getUserMedia() method to obtain a local MediaStream, which prompts the user to authorize access to their camera and microphone. The obtained stream can then be added to an RTCPeerConnection for transmission to the remote peer.

2.3. `RTCDataChannel`

In addition to audio and video, WebRTC supports the transmission of arbitrary binary data between peers through the RTCDataChannel API. This provides powerful functionality for:

File sharing
Real-time text chat
Online game state synchronization
Remote desktop control

The RTCDataChannel API is designed similarly to WebSockets, offering reliable and unreliable, ordered and unordered transmission modes that developers can choose based on application requirements. It uses the SCTP protocol (Stream Control Transmission Protocol) for transmission and is encrypted via DTLS.

3. Connection Process in Detail

Establishing a WebRTC connection is a complex multi-stage process involving signaling, session description, and network path discovery.

3.1. Signaling

Interestingly, the WebRTC API itself does not include a signaling mechanism. Signaling is the process of exchanging metadata between peers before establishing communication. Developers must choose or implement their own signaling channel. Common technologies include WebSocket or XMLHttpRequest.

The signaling server acts as an intermediary, helping two clients who want to communicate exchange three types of information:

Session control messages: Used to open or close communication.
Network configuration: Information about the client's IP address and port.
Media capabilities: Codecs and resolutions supported by the client.

This process typically follows these steps:

Client A sends a “request call” message to the signaling server.
The signaling server forwards this request to client B.
Client B agrees to the call.
Afterward, clients A and B exchange SDP and ICE candidates through the signaling server until they find a viable connection path.

sequenceDiagram
participant ClientA as Client A
participant SignalingServer as Signaling Server
participant ClientB as Client B
ClientA->>SignalingServer: Initiate call request (join room)
SignalingServer->>ClientB: Forward call request
ClientB-->>SignalingServer: Accept call
SignalingServer-->>ClientA: B has joined
loop Offer/Answer & ICE Exchange
ClientA->>SignalingServer: Send SDP Offer / ICE Candidate
SignalingServer->>ClientB: Forward SDP Offer / ICE Candidate
ClientB->>SignalingServer: Send SDP Answer / ICE Candidate
SignalingServer->>ClientA: Forward SDP Answer / ICE Candidate
end

3.2. Session Description Protocol (SDP)

SDP (Session Description Protocol) is a standard format for describing multimedia connection content. It doesn't transmit media data itself but describes the connection parameters. An SDP object includes:

Session unique identifier and version.
Media types (audio, video, data).
Codecs used (e.g., VP8, H.264, Opus).
Network transport information (IP addresses and ports).
Bandwidth information.

WebRTC uses the Offer/Answer model to exchange SDP information:

The Caller creates an Offer SDP describing the communication parameters it desires and sends it to the receiver through the signaling server.
The Callee receives the Offer and creates an Answer SDP describing the communication parameters it can support, sending it back to the caller through the signaling server.
Once both parties accept each other's SDP, they have reached a consensus on the session parameters.

sequenceDiagram
participant Caller
participant SignalingServer as Signaling Server
participant Callee
Caller->>Caller: createOffer()
Caller->>Caller: setLocalDescription(offer)
Caller->>SignalingServer: Send Offer
SignalingServer->>Callee: Forward Offer
Callee->>Callee: setRemoteDescription(offer)
Callee->>Callee: createAnswer()
Callee->>Callee: setLocalDescription(answer)
Callee->>SignalingServer: Send Answer
SignalingServer->>Caller: Forward Answer
Caller->>Caller: setRemoteDescription(answer)

3.3. Interactive Connectivity Establishment (ICE)

Since most devices are behind NAT (Network Address Translation) or firewalls and don't have public IP addresses, establishing direct P2P connections becomes challenging. ICE (Interactive Connectivity Establishment) is a framework specifically designed to solve this problem.

The ICE workflow is as follows:

Gather candidate addresses: Each client collects its network address candidates from different sources:
- Local addresses: The device's IP address within the local network.
- Server Reflexive Address: The device's public IP address and port discovered through a STUN server.
- Relayed Address: A relay address obtained through a TURN server. When P2P direct connection fails, all data will be forwarded through the TURN server.
Exchange candidates: Clients exchange their collected ICE candidate lists through the signaling server.
Connectivity checks: Clients pair up the received candidate addresses and send STUN requests for connectivity checks (called “pings”) to determine which paths are available.
Select the best path: Once a viable address pair is found, the ICE agent selects it as the communication path and begins transmitting media data. P2P direct connection paths are typically prioritized because they have the lowest latency.

graph TD
subgraph Client A
A1(Start) --> A2{Gather Candidates};
A2 --> A3[Local Address];
A2 --> A4[STUN Address];
A2 --> A5[TURN Address];
end
subgraph Client B
B1(Start) --> B2{Gather Candidates};
B2 --> B3[Local Address];
B2 --> B4[STUN Address];
B2 --> B5[TURN Address];
end
A2 --> C1((Signaling Server));
B2 --> C1;
C1 --> A6(Exchange Candidates);
C1 --> B6(Exchange Candidates);
A6 --> A7{Connectivity Checks};
B6 --> B7{Connectivity Checks};
A7 -- STUN Request --> B7;
B7 -- STUN Response --> A7;
A7 --> A8(Select Best Path);
B7 --> B8(Select Best Path);
A8 --> A9((P2P Connection Established));
B8 --> B9((P2P Connection Established));

4. NAT Traversal: STUN and TURN

To achieve P2P connections, WebRTC heavily relies on STUN and TURN servers to solve NAT-related issues.

4.1. STUN Servers

STUN (Session Traversal Utilities for NAT) servers are very lightweight, with a simple task: telling a client behind NAT what its public IP address and port are.

When a WebRTC client sends a request to a STUN server, the server checks the source IP and port of the request and returns them to the client. This way, the client knows “what it looks like on the internet” and can share this public address as an ICE candidate with other peers.

Using STUN servers is the preferred approach for establishing P2P connections because they are only needed during the connection establishment phase and don't participate in actual data transmission, resulting in minimal overhead.

4.2. TURN Servers

However, in some complex network environments (such as symmetric NAT), peers cannot establish direct connections even if they know their public addresses. This is where TURN (Traversal Using Relays around NAT) servers come in.

A TURN server is a more powerful relay server. When P2P connection fails, both clients connect to the TURN server, which then forwards all audio, video, and data between them. This is no longer true P2P communication, but it ensures that connections can still be established under the worst network conditions.

Using TURN servers increases latency and server bandwidth costs, so they are typically used as a last resort.

5. Security

Security is a core principle in WebRTC design, with all communications mandatorily encrypted and unable to be disabled.

Signaling security: The WebRTC standard doesn't specify a signaling protocol but recommends using secure WebSocket (WSS) or HTTPS to encrypt signaling messages.
Media encryption: All audio/video streams use SRTP (Secure Real-time Transport Protocol) for encryption. SRTP prevents eavesdropping and content tampering by encrypting and authenticating RTP packets.
Data encryption: All RTCDataChannel data is encrypted using DTLS (Datagram Transport Layer Security). DTLS is a protocol based on TLS that provides the same security guarantees for datagrams.

Key exchange is automatically completed during the RTCPeerConnection establishment process through the DTLS handshake. This means a secure channel is established before any media or data exchange occurs.

6. Practical Application Cases

With its powerful features, WebRTC has been widely applied in various scenarios:

Video conferencing systems: Such as Google Meet, Jitsi Meet, etc., allowing multi-party real-time audio/video calls.
Online education platforms: Enabling remote interactive teaching between teachers and students.
Telemedicine: Allowing doctors to conduct video consultations with patients remotely.
P2P file sharing: Using RTCDataChannel for fast file transfers between browsers.
Cloud gaming and real-time games: Providing low-latency instruction and data synchronization for games.
Online customer service and video support: Businesses providing real-time video support services to customers through web pages.

7. Conclusion

WebRTC is a revolutionary technology that brings real-time communication capabilities directly into browsers, greatly lowering the barrier to developing rich media applications. Through the three core APIs of RTCPeerConnection, MediaStream, and RTCDataChannel, combined with powerful signaling, ICE, and security mechanisms, WebRTC provides a complete, robust, and secure real-time communication solution.

As network technology develops and 5G becomes more widespread, WebRTC's application scenarios will become even broader, with its potential in emerging fields such as IoT, augmented reality (AR), and virtual reality (VR) gradually becoming apparent. For developers looking to integrate high-quality, low-latency communication features into their applications, WebRTC is undoubtedly one of the most worthwhile technologies to focus on and learn about today.