<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Network Communication | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/categories/network-communication/</link><atom:link href="https://ziyanglin.netlify.app/en/categories/network-communication/index.xml" rel="self" type="application/rss+xml"/><description>Network Communication</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Sat, 28 Jun 2025 14:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Network Communication</title><link>https://ziyanglin.netlify.app/en/categories/network-communication/</link></image><item><title>SIP and VoIP Communication Technology: A Comprehensive Guide from Principles to Practice</title><link>https://ziyanglin.netlify.app/en/post/sip-voip-technical-analysis/</link><pubDate>Sat, 28 Jun 2025 14:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/sip-voip-technical-analysis/</guid><description>&lt;h2 id="1-introduction-the-world-of-voip-and-sip">1. Introduction: The World of VoIP and SIP&lt;/h2>
&lt;h3 id="11-what-is-voip">1.1 What is VoIP?&lt;/h3>
&lt;p>VoIP (Voice over Internet Protocol) is a revolutionary technology that transmits voice communications over IP networks. Essentially, it digitizes, compresses, and packages human voice (analog signals), transmits them through IP networks (like the internet), and then unpacks, decompresses, and converts them back to sound at the receiving end.&lt;/p>
&lt;p>&lt;strong>Core Concept&lt;/strong>: Treating voice as data, transmitting it over networks just like sending emails or browsing websites.&lt;/p>
&lt;p>This breaks the dependency on physical telephone lines that traditional telephone systems (PSTN - Public Switched Telephone Network) rely on, bringing tremendous flexibility and cost advantages.&lt;/p>
&lt;h3 id="12-sip-the-traffic-director-of-voip">1.2 SIP: The &amp;ldquo;Traffic Director&amp;rdquo; of VoIP&lt;/h3>
&lt;p>If VoIP is a complete communication system, then SIP (Session Initiation Protocol) is its brain and traffic director.&lt;/p>
&lt;p>SIP itself doesn't transmit voice data. Its core responsibility is &lt;strong>signaling&lt;/strong>, handling the &lt;strong>creation (Setup), management, and termination (Teardown)&lt;/strong> of communication sessions.&lt;/p>
&lt;p>It can be understood this way:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>You want to call a friend&lt;/strong>: SIP is responsible for finding where your friend is (address resolution), telling their phone &amp;ldquo;someone's looking for you,&amp;rdquo; making their phone ring (session invitation).&lt;/li>
&lt;li>&lt;strong>Your friend answers the call&lt;/strong>: SIP confirms both parties are ready and the conversation can begin.&lt;/li>
&lt;li>&lt;strong>The call ends, you hang up&lt;/strong>: SIP notifies both parties that the call has ended and resources can be released.&lt;/li>
&lt;/ul>
&lt;p>SIP is an application layer protocol deeply influenced by HTTP and SMTP, using text format, easy to understand and extend. Due to its flexibility and powerful functionality, SIP has become the mainstream signaling protocol in modern VoIP systems.&lt;/p>
&lt;h3 id="13-voip-vs-pstn-a-communication-revolution">1.3 VoIP vs. PSTN: A Communication Revolution&lt;/h3>
&lt;p>To more intuitively understand the disruptive nature of VoIP, we can compare it with traditional PSTN.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">PSTN (Traditional Telephone)&lt;/th>
&lt;th align="left">VoIP (Network Telephone)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Network Foundation&lt;/strong>&lt;/td>
&lt;td align="left">Dedicated, circuit-switched network&lt;/td>
&lt;td align="left">Common, packet-switched IP network&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Connection Method&lt;/strong>&lt;/td>
&lt;td align="left">Establishes a physical exclusive line before calling&lt;/td>
&lt;td align="left">Data packets are independently routed in the network, sharing bandwidth&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Core Principle&lt;/strong>&lt;/td>
&lt;td align="left">Circuit switching&lt;/td>
&lt;td align="left">Packet switching&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Functionality&lt;/strong>&lt;/td>
&lt;td align="left">Mainly limited to voice calls&lt;/td>
&lt;td align="left">Integrates voice, video, messaging, presence display, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Cost&lt;/strong>&lt;/td>
&lt;td align="left">Depends on distance and call duration, expensive long-distance calls&lt;/td>
&lt;td align="left">Mainly depends on network bandwidth cost, no difference between long-distance and local calls&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Flexibility&lt;/strong>&lt;/td>
&lt;td align="left">Number bound to physical line&lt;/td>
&lt;td align="left">Number (address) bound to user, can be used anywhere with network access&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Phone A] -- Analog Signal --&amp;gt; B(PSTN Switch)
B -- Establish Physical Circuit --&amp;gt; C(PSTN Switch)
C -- Analog Signal --&amp;gt; D[Phone B]
subgraph Traditional PSTN Call
A &amp;amp; B &amp;amp; C &amp;amp; D
end
E[VoIP Terminal A] -- Digital Packets --&amp;gt; F{Internet / IP Network}
F -- Digital Packets --&amp;gt; G[VoIP Terminal B]
subgraph VoIP Call
E &amp;amp; F &amp;amp; G
end
&lt;/code>&lt;/pre>
&lt;p>In the following chapters, we will delve into the technology stack that makes up VoIP systems and analyze every detail of the SIP protocol.&lt;/p>
&lt;h2 id="2-voip-core-technology-stack-macro-perspective">2. VoIP Core Technology Stack (Macro Perspective)&lt;/h2>
&lt;p>From a macro perspective, VoIP is not a single technology but a complex yet orderly technological system composed of multiple protocols working together. Understanding its layered architecture is key to grasping the global view of VoIP.&lt;/p>
&lt;h3 id="21-layered-architecture">2.1 Layered Architecture&lt;/h3>
&lt;p>The VoIP technology stack can be roughly divided into four layers, each depending on the services provided by the layer below it.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;&amp;lt;b&amp;gt;Application Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;SIP, SDP, Voice/Video Applications&amp;quot;]
B[&amp;quot;&amp;lt;b&amp;gt;Transport Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;UDP, TCP, RTP, RTCP&amp;quot;]
C[&amp;quot;&amp;lt;b&amp;gt;Network Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;IP&amp;quot;]
D[&amp;quot;&amp;lt;b&amp;gt;Data Link &amp;amp; Physical Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Ethernet, Wi-Fi, 4G/5G&amp;quot;]
A --&amp;gt; B --&amp;gt; C --&amp;gt; D
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Application Layer&lt;/strong>: This is the layer closest to users.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Signaling Protocols&lt;/strong>: Such as &lt;strong>SIP&lt;/strong>, which we focus on, and its predecessor &lt;strong>H.323&lt;/strong>. They are responsible for control operations like &amp;ldquo;making calls&amp;rdquo; and &amp;ldquo;hanging up.&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Media Description Protocol&lt;/strong>: &lt;strong>SDP (Session Description Protocol)&lt;/strong> plays a crucial role. It doesn't transmit media but is used to describe media stream attributes in detail, such as: What codec to use (G.711, Opus)? What are the IP address and port? Is it audio or video? SDP content is typically exchanged &amp;ldquo;carried&amp;rdquo; by SIP.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Transport Layer&lt;/strong>: Responsible for end-to-end data transmission.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>UDP (User Datagram Protocol)&lt;/strong>: Due to its real-time, low-overhead characteristics, it is the &lt;strong>preferred choice&lt;/strong> for VoIP media data (voice packets) transmission. It doesn't guarantee reliability, allowing packet loss, which is acceptable for real-time voice (losing a packet or two might just be a momentary noise, while waiting for retransmission would cause serious delay and jitter). The &lt;strong>RTP (Real-time Transport Protocol)&lt;/strong> is built on top of UDP.&lt;/li>
&lt;li>&lt;strong>TCP (Transmission Control Protocol)&lt;/strong>: For signaling messages (like SIP) that require absolute reliability, TCP is typically chosen. It ensures critical commands like &amp;ldquo;INVITE&amp;rdquo; or &amp;ldquo;BYE&amp;rdquo; are not lost. Of course, SIP can also run on UDP and ensure reliability through its own retransmission mechanism.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Network Layer&lt;/strong>: The core is &lt;strong>IP (Internet Protocol)&lt;/strong>, responsible for packet routing and addressing, ensuring data packets can travel from the source through complex networks to reach their destination.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Data Link &amp;amp; Physical Layer&lt;/strong>: This is the most fundamental infrastructure, including Ethernet, Wi-Fi, fiber optics, etc., responsible for transmitting data bit streams over physical media.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="22-key-protocols-overview">2.2 Key Protocols Overview&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Protocol&lt;/th>
&lt;th align="left">Full Name&lt;/th>
&lt;th align="left">Layer&lt;/th>
&lt;th align="left">Core Function&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>SIP&lt;/strong>&lt;/td>
&lt;td align="left">Session Initiation Protocol&lt;/td>
&lt;td align="left">Application Layer&lt;/td>
&lt;td align="left">Establish, manage, and terminate multimedia sessions (signaling control).&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>SDP&lt;/strong>&lt;/td>
&lt;td align="left">Session Description Protocol&lt;/td>
&lt;td align="left">Application Layer&lt;/td>
&lt;td align="left">Describe media session parameters, such as IP address, port, codec, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>RTP&lt;/strong>&lt;/td>
&lt;td align="left">Real-time Transport Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Carry real-time data (such as voice, video), provide timestamps and sequence numbers.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>RTCP&lt;/strong>&lt;/td>
&lt;td align="left">Real-time Transport Control Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Used in conjunction with RTP, providing Quality of Service (QoS) monitoring and feedback.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>UDP&lt;/strong>&lt;/td>
&lt;td align="left">User Datagram Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Provide low-latency, unreliable datagram transmission for RTP.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>TCP&lt;/strong>&lt;/td>
&lt;td align="left">Transmission Control Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Provide reliable, connection-oriented transmission for signaling like SIP.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>STUN/TURN/ICE&lt;/strong>&lt;/td>
&lt;td align="left">(See NAT chapter)&lt;/td>
&lt;td align="left">Application Layer&lt;/td>
&lt;td align="left">Used to solve connectivity issues brought by Network Address Translation (NAT).&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>SRTP&lt;/strong>&lt;/td>
&lt;td align="left">Secure Real-time Transport Protocol&lt;/td>
&lt;td align="left">Transport/Application Layer&lt;/td>
&lt;td align="left">Secure version of RTP, providing encryption and authentication for media streams.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>TLS&lt;/strong>&lt;/td>
&lt;td align="left">Transport Layer Security&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Used to encrypt SIP signaling (SIPS), ensuring confidentiality and integrity of signaling.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Having understood this macro picture, we can now delve into the most important protocol—SIP, to explore how it elegantly accomplishes communication direction.&lt;/p>
&lt;h2 id="3-sip-protocol-indepth-analysis-micro-details">3. SIP Protocol In-Depth Analysis (Micro Details)&lt;/h2>
&lt;p>Now, we formally enter the world of SIP. SIP's design philosophy is &amp;ldquo;simplicity&amp;rdquo; and &amp;ldquo;extensibility,&amp;rdquo; borrowing heavily from HTTP design concepts. If you understand HTTP, learning SIP will feel very familiar.&lt;/p>
&lt;h3 id="31-sip-core-components">3.1 SIP Core Components&lt;/h3>
&lt;p>A typical SIP network consists of the following logical components:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph User A
UAC_A[User Agent Client UAC]
UAS_A[User Agent Server UAS]
UAC_A &amp;lt;--&amp;gt; UAS_A
end
subgraph SIP Network Infrastructure
Proxy[Proxy Server]
Registrar[Registrar Server]
Redirect[Redirect Server]
Proxy --- Registrar
end
subgraph User B
UAC_B[User Agent Client UAC]
UAS_B[User Agent Server UAS]
UAC_B &amp;lt;--&amp;gt; UAS_B
end
UAS_A -- SIP Request --&amp;gt; Proxy;
Proxy -- SIP Request --&amp;gt; UAS_B;
UAS_B -- SIP Response --&amp;gt; Proxy;
Proxy -- SIP Response --&amp;gt; UAS_A;
UAS_A -- Register --&amp;gt; Registrar;
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>User Agent (UA)&lt;/strong>: This is the terminal device in the SIP world. It can be:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Hardware Phone&lt;/strong>: Looks like a traditional phone but runs the SIP protocol internally.&lt;/li>
&lt;li>&lt;strong>Softphone&lt;/strong>: An application installed on a computer or mobile phone.&lt;/li>
&lt;li>Any device capable of initiating or receiving SIP sessions.&lt;/li>
&lt;/ul>
&lt;p>A UA contains two parts:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>User Agent Client (UAC)&lt;/strong>: Responsible for &lt;strong>initiating&lt;/strong> SIP requests. When you make a call, your device is a UAC.&lt;/li>
&lt;li>&lt;strong>User Agent Server (UAS)&lt;/strong>: Responsible for &lt;strong>receiving&lt;/strong> SIP requests and providing responses. When your phone rings, your device is a UAS.
In a complete two-way call, &lt;strong>each party's device is simultaneously both a UAC and a UAS&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Proxy Server&lt;/strong>: This is the central nervous system of the SIP network. It receives requests from UACs and &lt;strong>forwards&lt;/strong> them to the target UAS. The proxy server itself does not initiate requests, but it may modify certain parts of the request for policy enforcement (such as billing, routing policies). It is the &amp;ldquo;middleman&amp;rdquo; of the call.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Registrar Server&lt;/strong>: It functions like an &amp;ldquo;address book.&amp;rdquo; When a UA starts up and connects to the network, it sends a &lt;code>REGISTER&lt;/code> request to the Registrar, telling the server: &amp;ldquo;I'm Bob, my SIP address is &lt;code>sip:bob@example.com&lt;/code>, and my current IP address is &lt;code>192.168.1.100&lt;/code>&amp;rdquo;. The Registrar is responsible for maintaining this address mapping relationship (i.e., the binding between the user's SIP URI and their actual network location). When someone wants to call Bob, the Proxy server queries the Registrar to find Bob's current location.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Redirect Server&lt;/strong>: It's somewhat similar to a Proxy, but &amp;ldquo;lazier.&amp;rdquo; When it receives a request, it doesn't forward it itself but directly replies to the UAC with a &amp;ldquo;3xx&amp;rdquo; response, telling the UAC: &amp;ldquo;The person you're looking for is at &lt;code>sip:bob@192.168.1.100&lt;/code>, go find him yourself.&amp;rdquo; The UAC needs to initiate a new request based on this new address. This mode is less common in practical applications than the proxy mode.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="32-sip-messages-the-harmony-with-http">3.2 SIP Messages: The Harmony with HTTP&lt;/h3>
&lt;p>SIP messages are plain text and come in two types: &lt;strong>Request&lt;/strong> and &lt;strong>Response&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>A typical SIP request (INVITE):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>INVITE sip:bob@biloxi.com SIP/2.0
Via: SIP/2.0/UDP pc33.atlanta.com;branch=z9hG4bK776asdhds
Max-Forwards: 70
To: Bob &amp;lt;sip:bob@biloxi.com&amp;gt;
From: Alice &amp;lt;sip:alice@atlanta.com&amp;gt;;tag=1928301774
Call-ID: a84b4c76e66710
CSeq: 314159 INVITE
Contact: &amp;lt;sip:alice@pc33.atlanta.com&amp;gt;
Content-Type: application/sdp
Content-Length: 142
(Message body: SDP content here...)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Request message structure analysis:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request Line&lt;/strong>: &lt;code>Method Request-URI Version&lt;/code>
&lt;ul>
&lt;li>&lt;strong>Method&lt;/strong>: Defines the purpose of the request. Common methods include:
&lt;ul>
&lt;li>&lt;code>INVITE&lt;/code>: Initiates a session invitation.&lt;/li>
&lt;li>&lt;code>ACK&lt;/code>: Confirms a final response to an &lt;code>INVITE&lt;/code>.&lt;/li>
&lt;li>&lt;code>BYE&lt;/code>: Terminates an established session.&lt;/li>
&lt;li>&lt;code>CANCEL&lt;/code>: Cancels an incomplete &lt;code>INVITE&lt;/code> request.&lt;/li>
&lt;li>&lt;code>REGISTER&lt;/code>: Registers user location with a Registrar server.&lt;/li>
&lt;li>&lt;code>OPTIONS&lt;/code>: Queries the capabilities of a server or UA.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Request-URI&lt;/strong>: The target address of the request, i.e., &lt;code>sip:user@domain&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Header Fields&lt;/strong>: Key-value pairs in the form of &lt;code>Field Name: Field Value&lt;/code>, providing detailed information about the message.
&lt;ul>
&lt;li>&lt;code>Via&lt;/code>: Records the path the request has taken. Each hop proxy adds its own address at the top. Response messages will return along the path specified by the &lt;code>Via&lt;/code> header. The &lt;code>branch&lt;/code> parameter is a key part of the transaction ID.&lt;/li>
&lt;li>&lt;code>From&lt;/code> / &lt;code>To&lt;/code>: Represent the initiator and recipient of the call, respectively. The &lt;code>tag&lt;/code> parameter uniquely identifies a party in a call and is key to the dialog.&lt;/li>
&lt;li>&lt;code>Call-ID&lt;/code>: Uniquely identifies a complete call globally. All requests and responses related to this call use the same &lt;code>Call-ID&lt;/code>.&lt;/li>
&lt;li>&lt;code>CSeq&lt;/code>: Command Sequence, containing a number and a method name, used to order and distinguish multiple transactions under the same &lt;code>Call-ID&lt;/code>.&lt;/li>
&lt;li>&lt;code>Contact&lt;/code>: Provides a direct contact address (URI) for the request initiator. In an &lt;code>INVITE&lt;/code>, it tells the other party where subsequent requests (like &lt;code>BYE&lt;/code>) should be sent directly.&lt;/li>
&lt;li>&lt;code>Content-Type&lt;/code>: Describes the media type of the message body, typically &lt;code>application/sdp&lt;/code>.&lt;/li>
&lt;li>&lt;code>Content-Length&lt;/code>: The length of the message body.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>A typical SIP response (200 OK):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>SIP/2.0 200 OK
Via: SIP/2.0/UDP pc33.atlanta.com;branch=z9hG4bK776asdhds;received=192.0.2.4
To: Bob &amp;lt;sip:bob@biloxi.com&amp;gt;;tag=a6c85cf
From: Alice &amp;lt;sip:alice@atlanta.com&amp;gt;;tag=1928301774
Call-ID: a84b4c76e66710
CSeq: 314159 INVITE
Contact: &amp;lt;sip:bob@198.51.100.3&amp;gt;
Content-Type: application/sdp
Content-Length: 131
(Message body: SDP content here...)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Response message structure analysis:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Status Line&lt;/strong>: &lt;code>Version Status-Code Reason-Phrase&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="33-a-complete-call-sip-session-flow-explained">3.3 A Complete Call: SIP Session Flow Explained&lt;/h3>
&lt;p>Below, we use a Mermaid sequence diagram to break down a typical SIP call flow, from user registration, to A calling B, and finally hanging up.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant Alice as Alice's UA
participant Proxy as SIP Proxy Server
participant Registrar as Registrar Server
participant Bob as Bob's UA
Alice-&amp;gt;&amp;gt;Registrar: REGISTER
Registrar-&amp;gt;&amp;gt;Alice: 200 OK
Bob-&amp;gt;&amp;gt;Registrar: REGISTER
Registrar-&amp;gt;&amp;gt;Bob: 200 OK
Alice-&amp;gt;&amp;gt;Proxy: INVITE
Proxy-&amp;gt;&amp;gt;Bob: INVITE
Bob-&amp;gt;&amp;gt;Proxy: 180 Ringing
Proxy-&amp;gt;&amp;gt;Alice: 180 Ringing
Bob-&amp;gt;&amp;gt;Proxy: 200 OK
Proxy-&amp;gt;&amp;gt;Alice: 200 OK
Alice-&amp;gt;&amp;gt;Proxy: ACK
Proxy-&amp;gt;&amp;gt;Bob: ACK
Alice-&amp;gt;&amp;gt;Bob: RTP Media Stream
Bob-&amp;gt;&amp;gt;Alice: RTP Media Stream
Bob-&amp;gt;&amp;gt;Proxy: BYE
Proxy-&amp;gt;&amp;gt;Alice: BYE
Alice-&amp;gt;&amp;gt;Proxy: 200 OK
Proxy-&amp;gt;&amp;gt;Bob: 200 OK
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Flow Breakdown&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Registration (1-4)&lt;/strong>: After coming online, Alice and Bob each register their locations with the Registrar. This is the prerequisite for others to find them.&lt;/li>
&lt;li>&lt;strong>Call (5-12)&lt;/strong>: This is the famous &amp;ldquo;three-way handshake&amp;rdquo; process (&lt;code>INVITE&lt;/code> -&amp;gt; &lt;code>200 OK&lt;/code> -&amp;gt; &lt;code>ACK&lt;/code>).
&lt;ul>
&lt;li>&lt;strong>INVITE&lt;/strong>: Alice initiates the call, carrying her prepared media information (SDP) in the request, describing the media types, codecs, and IP/port she can receive.&lt;/li>
&lt;li>&lt;strong>1xx Provisional Responses&lt;/strong>: The Proxy and Bob return &lt;code>100 Trying&lt;/code> (not shown in the diagram) and &lt;code>180 Ringing&lt;/code>, telling Alice &amp;ldquo;please wait, processing/the other phone is ringing.&amp;rdquo; This effectively prevents the UAC from resending &lt;code>INVITE&lt;/code> due to timeout.&lt;/li>
&lt;li>&lt;strong>200 OK&lt;/strong>: When Bob answers the call, his UA sends a &lt;code>200 OK&lt;/code> response containing &lt;strong>his own SDP information&lt;/strong>. At this point, media negotiation is complete, and both parties know each other's media capabilities and receiving addresses.&lt;/li>
&lt;li>&lt;strong>ACK&lt;/strong>: After receiving the &lt;code>200 OK&lt;/code>, Alice must send an &lt;code>ACK&lt;/code> request to confirm. &lt;code>ACK&lt;/code> is an independent transaction used to confirm the final response. When Bob receives the &lt;code>ACK&lt;/code>, a complete SIP dialog is formally established.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Media Transmission&lt;/strong>: After the dialog is established, Alice and Bob can bypass the Proxy server and &lt;strong>directly&lt;/strong> send RTP voice packets to each other based on the IP and port information obtained from each other's SDP. &lt;strong>The path taken by signaling (through Proxy) and media (P2P) can be different&lt;/strong>, which is an important feature of SIP.&lt;/li>
&lt;li>&lt;strong>Termination (13-16)&lt;/strong>: Either party can end the call by sending a &lt;code>BYE&lt;/code> request. Upon receiving it, the other party replies with a &lt;code>200 OK&lt;/code>, and the call is cleanly terminated.&lt;/li>
&lt;/ol>
&lt;h3 id="34-sdp-blueprint-for-media-sessions">3.4 SDP: Blueprint for Media Sessions&lt;/h3>
&lt;p>SDP (Session Description Protocol) is a perfect match for SIP, but it is an independent protocol (RFC 4566). It doesn't transmit any media data itself but is used to &lt;strong>describe&lt;/strong> media sessions. It's like a blueprint, detailing the specifications of the &amp;ldquo;communication building&amp;rdquo; to be constructed.&lt;/p>
&lt;p>&lt;strong>A typical SDP example (in an INVITE request):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>v=0
o=alice 2890844526 2890844526 IN IP4 pc33.atlanta.com
s=SIP Call
c=IN IP4 192.0.2.4
t=0 0
m=audio 49170 RTP/AVP 0 8 97
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 iLBC/8000
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Key fields analysis&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;code>v=0&lt;/code>: Protocol version.&lt;/li>
&lt;li>&lt;code>o=&lt;/code>: (owner/creator) Describes the session initiator's information, including username, session ID, version number, etc.&lt;/li>
&lt;li>&lt;code>s=&lt;/code>: Session name.&lt;/li>
&lt;li>&lt;code>c=&lt;/code>: (connection data) Connection information. &lt;strong>Very important&lt;/strong>, it specifies the address where media streams should be sent (&lt;code>IN&lt;/code> means Internet, &lt;code>IP4&lt;/code> means IPv4, followed by the IP address).&lt;/li>
&lt;li>&lt;code>t=&lt;/code>: (time) Session start and end times, &lt;code>0 0&lt;/code> means permanent.&lt;/li>
&lt;li>&lt;code>m=&lt;/code>: (media description) Media description. &lt;strong>Crucial&lt;/strong>.
&lt;ul>
&lt;li>&lt;code>audio&lt;/code>: Media type is audio.&lt;/li>
&lt;li>&lt;code>49170&lt;/code>: &lt;strong>Port to which media will be sent&lt;/strong>.&lt;/li>
&lt;li>&lt;code>RTP/AVP&lt;/code>: Transport protocol used is RTP.&lt;/li>
&lt;li>&lt;code>0 8 97&lt;/code>: &lt;strong>Proposed codec list&lt;/strong> (payload type). This is a priority list, meaning &amp;ldquo;I prefer to use 0, then 8, then 97.&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>a=rtpmap: ...&lt;/code>: (attribute) Attribute line, mapping the payload type numbers in the &lt;code>m&lt;/code> line to specific codec names and clock frequencies. For example, &lt;code>a=rtpmap:0 PCMU/8000&lt;/code> means payload type 0 corresponds to G.711u (PCMU) with a sampling rate of 8000Hz.&lt;/li>
&lt;/ul>
&lt;p>This model is called the &lt;strong>Offer/Answer Model&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Offer&lt;/strong>: Alice sends her SDP in the &lt;code>INVITE&lt;/code>, which is an Offer, listing all the codecs she supports and her receiving address/port.&lt;/li>
&lt;li>&lt;strong>Answer&lt;/strong>: Upon receiving it, Bob selects a codec he also supports from Alice's list (e.g., PCMA) and returns this selected codec along with &lt;strong>his own&lt;/strong> receiving address/port in the SDP of the &lt;code>200 OK&lt;/code>.&lt;/li>
&lt;/ol>
&lt;p>When Alice receives this Answer, both parties have reached a consensus: use the PCMA codec, Alice sends RTP packets to Bob's IP/port, and Bob sends RTP packets to Alice's IP/port.&lt;/p>
&lt;h2 id="4-media-stream-transmission-rtp-and-rtcp">4. Media Stream Transmission: RTP and RTCP&lt;/h2>
&lt;p>We have successfully established the &amp;ldquo;signaling&amp;rdquo; connection through SIP/SDP, like two airport control centers coordinating flight plans between cities. Now, we need the actual &amp;ldquo;airplanes&amp;rdquo;—the RTP protocol—to transport our &amp;ldquo;passengers&amp;rdquo;—voice and video data.&lt;/p>
&lt;h3 id="41-rtp-born-for-realtime-data">4.1 RTP: Born for Real-time Data&lt;/h3>
&lt;p>RTP (Real-time Transport Protocol, RFC 3550) is a network protocol specifically designed for end-to-end transmission of real-time data such as audio and video. It typically runs on top of &lt;strong>UDP&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>Why UDP?&lt;/strong>
TCP provides reliable, ordered transmission, but at a cost: when a packet is lost, TCP stops sending subsequent packets until the lost packet is retransmitted and successfully received. For real-time voice, this delay paid for &amp;ldquo;reliability&amp;rdquo; is fatal. Losing a small piece of voice (perhaps just a fraction of a second of silence or faint noise) is far better than making the entire conversation stutter for several seconds while waiting for it. RTP is based on this principle of &amp;ldquo;tolerating packet loss, not tolerating delay,&amp;rdquo; choosing UDP as its ideal carrier.&lt;/p>
&lt;p>However, pure UDP just throws data packets to the other side on a &amp;ldquo;best effort&amp;rdquo; basis, providing no timing information or knowledge of packet order. RTP adds an additional header on top of UDP, giving data packets &amp;ldquo;life&amp;rdquo;: &lt;strong>timestamps&lt;/strong> and &lt;strong>sequence numbers&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>RTP Header Structure Explained&lt;/strong>&lt;/p>
&lt;p>A standard RTP header is at least 12 bytes, structured as follows:&lt;/p>
&lt;pre>&lt;code> 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| Contributing source (CSRC) identifiers |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>V (Version, 2 bits)&lt;/strong>: RTP protocol version, currently version 2.&lt;/li>
&lt;li>&lt;strong>P (Padding, 1 bit)&lt;/strong>: Padding bit. If set, indicates there are additional padding bytes at the end of the packet.&lt;/li>
&lt;li>&lt;strong>X (Extension, 1 bit)&lt;/strong>: Extension bit. If set, indicates an extension header follows the standard header.&lt;/li>
&lt;li>&lt;strong>CC (CSRC Count, 4 bits)&lt;/strong>: Contributing source count, indicating the number of CSRC identifiers following the fixed header.&lt;/li>
&lt;li>&lt;strong>M (Marker, 1 bit)&lt;/strong>: Marker bit. Its specific meaning is defined by the particular application profile. For example, in video streams, it can mark the end of a frame. In audio, it can mark the beginning of a silence period.&lt;/li>
&lt;li>&lt;strong>PT (Payload Type, 7 bits)&lt;/strong>: &lt;strong>Payload type&lt;/strong>. This is a very critical field, used to identify what format the media data in the RTP packet is. This number corresponds exactly to what we negotiated in the &lt;code>m=&lt;/code> line and &lt;code>a=rtpmap&lt;/code> line of SDP. For example, if SDP negotiation decides to use PCMU (payload type 0), then all RTP packets carrying PCMU data will have their PT field set to 0. When the receiver sees PT=0, it knows to use the PCMU decoder to process the data.&lt;/li>
&lt;li>&lt;strong>Sequence Number (16 bits)&lt;/strong>: &lt;strong>Sequence number&lt;/strong>. Increments by 1 for each RTP packet sent. This field has two core functions:
&lt;ol>
&lt;li>&lt;strong>Detecting packet loss&lt;/strong>: The receiver can determine if packets have been lost by checking if the received sequence numbers are consecutive.&lt;/li>
&lt;li>&lt;strong>Reordering&lt;/strong>: Due to different paths packets may take in the network, packets sent earlier might arrive later. The receiver can restore the original order of packets using the sequence number.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>Timestamp (32 bits)&lt;/strong>: &lt;strong>Timestamp&lt;/strong>. &lt;strong>This is the soul of RTP&lt;/strong>. It records the sampling moment of the media data in the packet. &lt;strong>Note: This timestamp is not a &amp;ldquo;wall clock&amp;rdquo;&lt;/strong> but is based on the media's sampling clock. For example, for audio sampled at 8000Hz, the clock &amp;ldquo;ticks&amp;rdquo; 8000 times per second. If a packet contains 20 milliseconds of audio data, the timestamp of the next packet will increase by &lt;code>8000 * 0.020 = 160&lt;/code>.
The main functions of the timestamp are:
&lt;ol>
&lt;li>&lt;strong>Synchronizing playback and eliminating jitter&lt;/strong>: Jitter refers to variations in packet arrival delay. The receiver sets up a &amp;ldquo;jitter buffer&amp;rdquo; to play media smoothly based on the timestamps on the packets, rather than playing at varying speeds, thus providing a smooth auditory/visual experience.&lt;/li>
&lt;li>&lt;strong>Multimedia synchronization&lt;/strong>: In a call containing both audio and video, audio and video streams are two separate RTP streams (with different SSRCs), but their timestamps can be based on the same reference clock. This allows the receiver to precisely align audio and video, achieving &amp;ldquo;lip sync.&amp;rdquo;&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>SSRC (Synchronization Source, 32 bits)&lt;/strong>: &lt;strong>Synchronization source&lt;/strong>. In an RTP session, each media stream source (such as a microphone or a camera) is assigned a randomly generated, globally unique SSRC value. For example, if Alice is sending both audio and video in a call, she will generate two SSRCs, one for the audio stream and one for the video stream. Intermediate devices like Proxies or Mixers can distinguish different streams based on SSRC.&lt;/li>
&lt;li>&lt;strong>CSRC (Contributing Source)&lt;/strong>: Contributing source. When multiple source media streams pass through a mixer and are merged into one stream, this field lists the SSRCs of all original contributors.
&lt;ul>
&lt;li>&lt;strong>Status Code&lt;/strong>: Very similar to HTTP status codes.
&lt;ul>
&lt;li>&lt;code>1xx&lt;/code> (Provisional): Request received, processing in progress. E.g., &lt;code>180 Ringing&lt;/code>.&lt;/li>
&lt;li>&lt;code>2xx&lt;/code> (Success): Request successfully processed. E.g., &lt;code>200 OK&lt;/code>.&lt;/li>
&lt;li>&lt;code>3xx&lt;/code> (Redirection): Further action needed.&lt;/li>
&lt;li>&lt;code>4xx&lt;/code> (Client Error): Request has syntax errors or cannot be processed on this server.&lt;/li>
&lt;li>&lt;code>5xx&lt;/code> (Server Error): Server failed to process the request.&lt;/li>
&lt;li>&lt;code>6xx&lt;/code> (Global Failure): The request cannot be processed by any server.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Header Fields&lt;/strong>: Most header fields (such as &lt;code>Via&lt;/code>, &lt;code>From&lt;/code>, &lt;code>To&lt;/code>, &lt;code>Call-ID&lt;/code>, &lt;code>CSeq&lt;/code>) are copied from the request to ensure the response can be correctly associated with the request. The &lt;code>tag&lt;/code> parameter in the &lt;code>To&lt;/code> field is added by the called party (UAS).&lt;/li>
&lt;/ul>
&lt;h3 id="42-rtcp-rtps-control-partner">4.2 RTCP: RTP's &amp;ldquo;Control Partner&amp;rdquo;&lt;/h3>
&lt;p>RTP is only responsible for &amp;ldquo;cargo transport,&amp;rdquo; but it doesn't know the quality of the &amp;ldquo;shipping.&amp;rdquo; RTCP (Real-time Transport Control Protocol) is the accompanying &amp;ldquo;quality supervisor.&amp;rdquo; It works in parallel with RTP, periodically sending control packets between participants to monitor the quality of service (QoS) of data transmission.&lt;/p>
&lt;p>RTCP packets and RTP packets use different UDP ports (typically RTP port number + 1). It doesn't transmit any media data itself, only control information, and its bandwidth usage is typically limited to within 5% of RTP bandwidth.&lt;/p>
&lt;p>&lt;strong>Core RTCP packet types and their functions:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Sender Report (SR)&lt;/strong>: Sent by the &lt;strong>media sender&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content&lt;/strong>: Contains the sender's SSRC, an &lt;strong>NTP timestamp&lt;/strong> (used for synchronization with the &amp;ldquo;wall clock,&amp;rdquo; enabling absolute time synchronization and cross-media stream synchronization), the RTP timestamp corresponding to the NTP timestamp, and the total number of packets and bytes sent.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: Lets the receiver know how much data has been sent and provides key information needed for cross-media stream synchronization (such as audio-video synchronization).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Receiver Report (RR)&lt;/strong>: Sent by the &lt;strong>media receiver&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content&lt;/strong>: Contains the SSRC of the source it is receiving from, and since the last report: &lt;strong>fraction lost&lt;/strong>, &lt;strong>cumulative number of packets lost&lt;/strong>, &lt;strong>highest sequence number received&lt;/strong>, and an estimate of &lt;strong>interarrival jitter&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: &lt;strong>This is the most important QoS feedback mechanism&lt;/strong>. After receiving an RR, the sender can understand the network conditions. If the report shows a high packet loss rate, the sender's application might make intelligent adjustments, such as switching to a more loss-resistant, lower bitrate codec, or notifying the user of poor network conditions.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Source Description (SDES)&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content&lt;/strong>: Provides additional information associated with an SSRC, most importantly the &lt;strong>CNAME (Canonical Name)&lt;/strong>. CNAME is a unique, persistent identifier for each endpoint.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: Used to associate different media streams (such as SSRC_audio and SSRC_video) from the same user. When the receiver sees two streams with the same CNAME, it knows they come from the same participant and can synchronize their playback.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>BYE&lt;/strong>: Used to explicitly indicate that a participant is leaving the session, closing a stream.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>APP&lt;/strong>: Used for application-specific extensions.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Through the collaborative work of RTP and RTCP, VoIP systems can not only efficiently transmit real-time media but also intelligently sense network quality and make adaptive adjustments, which is the technical cornerstone for achieving high-quality call experiences.&lt;/p>
&lt;h2 id="5-nat-traversal-breaking-through-network-barriers">5. NAT Traversal: Breaking Through Network Barriers&lt;/h2>
&lt;p>So far, our discussion of SIP and RTP flows has been based on an ideal assumption: both parties in the call have public IP addresses and can directly access each other. However, in the real world, the vast majority of user devices (computers, phones, IP phones) are behind home or office routers, using private IP addresses (such as &lt;code>192.168.x.x&lt;/code>).&lt;/p>
&lt;p>Network Address Translation (NAT) devices (i.e., what we commonly call routers) play the role of &amp;ldquo;gatekeepers,&amp;rdquo; allowing internal devices to access the internet but, by default, blocking unsolicited connections from the outside. This poses a huge challenge for VoIP communications.&lt;/p>
&lt;h3 id="51-the-nat-challenge">5.1 The NAT Challenge&lt;/h3>
&lt;p>Imagine Alice and Bob are both in their respective home networks, and they both have IP addresses of &lt;code>192.168.1.10&lt;/code>.&lt;/p>
&lt;ol>
&lt;li>Alice initiates a call and honestly fills in her media receiving address in the SDP of her &lt;code>INVITE&lt;/code> request: &lt;code>c=IN IP4 192.168.1.10&lt;/code> and &lt;code>m=audio 49170 ...&lt;/code>.&lt;/li>
&lt;li>This &lt;code>INVITE&lt;/code> request successfully reaches Bob through the SIP proxy.&lt;/li>
&lt;li>Bob's UA sees this SDP and becomes confused. It dutifully tries to send its RTP packets to the address &lt;code>192.168.1.10&lt;/code>. But this address is a private address in Bob's own network (it might even be his neighbor's printer address), not Alice on the public internet!&lt;/li>
&lt;li>The result is: &lt;strong>Media streams (RTP packets) cannot be delivered, and both parties can only hear their own side (or silence)&lt;/strong>.&lt;/li>
&lt;/ol>
&lt;p>This is the core challenge NAT poses to VoIP: &lt;strong>The private address information carried in SDP is useless and misleading to the other party on the public internet&lt;/strong>. To solve this problem, we need a mechanism to discover the device's &amp;ldquo;identity&amp;rdquo; on the public internet and establish a path that can penetrate NAT.&lt;/p>
&lt;h3 id="52-the-three-musketeers-of-nat-traversal-stun-turn-ice">5.2 The Three Musketeers of NAT Traversal: STUN, TURN, ICE&lt;/h3>
&lt;p>To solve the connectivity problems brought by NAT, IETF defined a complete solution, with the ICE protocol at its core, while ICE's work depends on two auxiliary protocols: STUN and TURN.&lt;/p>
&lt;h4 id="1-stun-session-traversal-utilities-for-nat">1. STUN (Session Traversal Utilities for NAT)&lt;/h4>
&lt;p>STUN (RFC 5389) is a simple client-server protocol, with its core functionality acting like a &amp;ldquo;mirror.&amp;rdquo;&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle&lt;/strong>: The UA (client) behind a private network sends a request to a STUN server on the public internet. Upon receiving the request, the STUN server checks which public IP and port the request came from, then packages this address (called the &lt;strong>Server-Reflexive Address&lt;/strong>) in a response and returns it to the client along the original path.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: After receiving the response, the client sees its &amp;ldquo;appearance&amp;rdquo; on the public internet in the &amp;ldquo;mirror.&amp;rdquo; It now knows: &amp;ldquo;Oh, when I send packets outward, my router maps my source address &lt;code>192.168.1.10:49170&lt;/code> to the public address &lt;code>203.0.113.10:8001&lt;/code>.&amp;rdquo; This way, it can fill this public address and port in the SDP and send it to the other party.&lt;/li>
&lt;/ul>
&lt;p>STUN can also be used to detect the type of NAT (e.g., full cone, restricted cone, port restricted cone, symmetric). Understanding the NAT type helps select the optimal traversal strategy.&lt;/p>
&lt;p>&lt;strong>Limitations&lt;/strong>: STUN is powerless against &amp;ldquo;Symmetric NAT.&amp;rdquo; In this strictest type of NAT, the router not only allocates a public port for each outbound session, but this port mapping relationship is &lt;strong>only valid for a specific destination IP and port&lt;/strong>. The public address &lt;code>203.0.113.10:8001&lt;/code> that Alice discovers through the STUN server is a dedicated mapping for her communication with the STUN server; Bob cannot use this address to send data to Alice.&lt;/p>
&lt;h4 id="2-turn-traversal-using-relays-around-nat">2. TURN (Traversal Using Relays around NAT)&lt;/h4>
&lt;p>When STUN fails due to symmetric NAT or other firewall policies, TURN (RFC 8656) is needed as the final &amp;ldquo;fallback&amp;rdquo; solution.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle&lt;/strong>: A TURN server is not just a &amp;ldquo;mirror&amp;rdquo;; it is a fully functional &lt;strong>public media relay&lt;/strong>.
&lt;ol>
&lt;li>The client first &lt;strong>allocates&lt;/strong> a relay address (public IP and port) on the TURN server.&lt;/li>
&lt;li>Then, the client tells the peer (through SIP/SDP) to send its media packets to this relay address.&lt;/li>
&lt;li>At the same time, the client also sends its media packets to the peer through the TURN server.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: All media streams are forwarded through the TURN server. Although this increases latency and consumes server bandwidth, it &lt;strong>guarantees connectivity&lt;/strong> because both communicating parties are actually communicating with the TURN server, which has a public address.&lt;/li>
&lt;/ul>
&lt;h4 id="3-ice-interactive-connectivity-establishment">3. ICE (Interactive Connectivity Establishment)&lt;/h4>
&lt;p>ICE (RFC 8445) is the real &amp;ldquo;commander-in-chief.&amp;rdquo; It doesn't invent new protocols but cleverly integrates STUN and TURN to form a systematic framework, establishing media paths between communicating parties in the most effective way.&lt;/p>
&lt;p>The ICE workflow can be divided into the following stages:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph 1. Gathering Candidates
A[UA-A] --&amp;gt;|STUN Request| B(STUN Server);
B --&amp;gt;|Server-Reflexive Addr| A;
A --&amp;gt;|Allocate Request| C(TURN Server);
C --&amp;gt;|Relayed Addr| A;
A --&amp;gt; D{Host Address};
D &amp;amp; B &amp;amp; C --&amp;gt; E((A's Candidate List));
end
subgraph 2. Exchanging Candidates
E -- via SIP/SDP --&amp;gt; F((B's Candidate List));
F -- via SIP/SDP --&amp;gt; E;
end
subgraph 3. Connectivity Checks
G(Candidate Pairs);
E &amp;amp; F --&amp;gt; G;
G --&amp;gt;|STUN Binding Requests| H{Check All Possible Paths};
end
subgraph 4. Selecting Best Path
H --&amp;gt;|Select Highest Priority&amp;lt;br/&amp;gt;Valid Path| I[Establish RTP/RTCP Streams];
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>ICE Process Explained&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Gathering Candidates&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Host Candidates&lt;/strong>: The UA first collects all IP addresses and ports on its local network interfaces.&lt;/li>
&lt;li>&lt;strong>Server-Reflexive Candidates&lt;/strong>: The UA uses a STUN server to discover its public mapping address.&lt;/li>
&lt;li>&lt;strong>Relayed Candidates&lt;/strong>: The UA allocates a relay address using a TURN server.&lt;/li>
&lt;li>In the end, each UA generates a list of candidates of various types with different priorities.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Exchanging Candidates&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Both parties exchange their candidate lists through the signaling channel (typically in the SDP of SIP's &lt;code>INVITE`/&lt;/code>200 OK` messages).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Connectivity Checks&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>After receiving the other party's address list, each UA pairs its local candidates with the other party's candidates, forming a &lt;strong>Candidate Pair&lt;/strong> list, sorted by priority (P2P &amp;gt; Server-Reflexive &amp;gt; Relayed).&lt;/li>
&lt;li>ICE begins &lt;strong>connectivity checks (STUN Binding Requests)&lt;/strong>. It starts from the highest priority address pair, sending STUN requests to each other. If a request successfully receives a response, that path is considered &lt;strong>valid&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Selecting the Best Path and Starting Media Transmission&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Once a valid path pair is found, the UA can start using it to send media data. But it doesn't stop immediately; it continues to check other possible path pairs.&lt;/li>
&lt;li>When all checks are complete, ICE selects the validated path with the highest priority as the final communication path.&lt;/li>
&lt;li>&lt;strong>Final Result&lt;/strong>:
&lt;ul>
&lt;li>If a Host-to-Host or Host-to-ServerReflexive path works, a P2P (or quasi-P2P) connection is achieved, which is most efficient.&lt;/li>
&lt;li>If all P2P attempts fail, ICE will ultimately choose a path relayed through the TURN server, sacrificing some performance to ensure the successful establishment of the call.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>Through ICE, VoIP systems can intelligently and dynamically adapt to various complex network environments, maximizing attempts to establish efficient P2P connections while gracefully degrading to relay mode when necessary, greatly improving the success rate and quality of VoIP calls.&lt;/p>
&lt;h2 id="6-voip-security-protecting-your-call-privacy">6. VoIP Security: Protecting Your Call Privacy&lt;/h2>
&lt;p>As VoIP becomes more widespread, its security also becomes increasingly important. An unprotected VoIP communication system faces risks of eavesdropping, fraud, and denial of service attacks. Fortunately, we have mature solutions to protect the two key parts of communication: signaling and media.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph UA-A
A[Alice's UA];
end;
subgraph UA-B
B[Bob's UA];
end;
subgraph SIP Proxy
P[Proxy Server];
end;
A -- &amp;quot;SIPS (SIP over TLS)&amp;lt;br&amp;gt;Signaling Encryption&amp;quot; --&amp;gt; P;
P -- &amp;quot;SIPS (SIP over TLS)&amp;lt;br&amp;gt;Signaling Encryption&amp;quot; --&amp;gt; B;
A -- &amp;quot;SRTP&amp;lt;br&amp;gt;Media Encryption&amp;quot; --&amp;gt; B;
style A fill:#D5F5E3;
style B fill:#D5F5E3;
style P fill:#EBF5FB;
&lt;/code>&lt;/pre>
&lt;h3 id="61-signaling-encryption-sips-sip-over-tls">6.1 Signaling Encryption: SIPS (SIP over TLS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Problem&lt;/strong>: Ordinary SIP messages are transmitted in plaintext. Attackers can easily sniff these messages on the network, obtaining metadata such as who the parties in the call are (&lt;code>From`/&lt;/code>To&lt;code> headers), the unique identifier of the call (&lt;/code>Call-ID`), and even tamper with message content, performing call hijacking or fraud.&lt;/li>
&lt;li>&lt;strong>Solution&lt;/strong>: &lt;strong>TLS (Transport Layer Security)&lt;/strong>, the same protocol used by HTTPS to encrypt web traffic.
&lt;ul>
&lt;li>&lt;strong>SIPS (Secure SIP)&lt;/strong>: When SIP runs on top of TLS, it is called SIPS. It encapsulates the entire SIP message (requests and responses) in an encrypted TLS channel for transmission.&lt;/li>
&lt;li>&lt;strong>Working Method&lt;/strong>: The UA and SIP proxy first establish a standard TLS handshake, exchanging certificates and negotiating encryption keys. Once the TLS connection is established, all subsequent SIP messages are transmitted within this encrypted channel, preventing outsiders from peeking at their content.&lt;/li>
&lt;li>&lt;strong>SIP URI&lt;/strong>: Addresses using SIPS are typically represented as &lt;code>sips:alice@example.com&lt;/code> and use port &lt;code>5061&lt;/code> by default instead of &lt;code>5060&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Through SIPS, we ensure the &lt;strong>confidentiality and integrity of call signaling&lt;/strong>.&lt;/p>
&lt;h3 id="62-media-encryption-srtp">6.2 Media Encryption: SRTP&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Problem&lt;/strong>: Even if signaling is encrypted, the actual voice/video data (RTP packets) are still in plaintext by default! Attackers may not know who is on the call, but if they can intercept the RTP stream, they can still eavesdrop on the conversation content.&lt;/li>
&lt;li>&lt;strong>Solution&lt;/strong>: &lt;strong>SRTP (Secure Real-time Transport Protocol)&lt;/strong>, RFC 3711.
&lt;ul>
&lt;li>&lt;strong>Working Method&lt;/strong>: SRTP is not an entirely new protocol but adds a layer of encryption and authentication on top of the RTP protocol. It &lt;strong>encrypts the payload portion of RTP&lt;/strong> but keeps the RTP header in plaintext (because network devices may need to read header information for QoS processing).&lt;/li>
&lt;li>&lt;strong>Key Exchange&lt;/strong>: SRTP itself does not specify how keys are exchanged. In practice, encryption keys are typically negotiated through a secure signaling channel (i.e., SIP/SDP messages encrypted with SIPS/TLS). This process is usually handled by a mechanism called &lt;strong>SDES (SDP Security Descriptions)&lt;/strong> or the more modern &lt;strong>DTLS-SRTP&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Functions&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Confidentiality&lt;/strong>: Using symmetric encryption algorithms (such as AES) to encrypt RTP payloads, ensuring that only the communicating parties with the key can decrypt the conversation content.&lt;/li>
&lt;li>&lt;strong>Message Authentication&lt;/strong>: Generating an Authentication Tag through algorithms like HMAC-SHA1. The receiver can use this to verify whether the message has been tampered with during transmission.&lt;/li>
&lt;li>&lt;strong>Replay Protection&lt;/strong>: Preventing attackers from capturing packets and resending them to conduct malicious attacks.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Alongside SRTP, there is also &lt;strong>SRTCP&lt;/strong>, which provides the same level of encryption and authentication protection for RTCP control packets.&lt;/p>
&lt;p>By combining SIPS and SRTP, we can build an end-to-end secure VoIP communication system, ensuring that the entire process from &amp;ldquo;who is calling&amp;rdquo; to &amp;ldquo;what is said on the phone&amp;rdquo; is tightly protected.&lt;/p>
&lt;h2 id="7-conclusion-and-future-outlook">7. Conclusion and Future Outlook&lt;/h2>
&lt;h3 id="conclusion">Conclusion&lt;/h3>
&lt;p>This document has provided an in-depth analysis of the two core technologies supporting modern network voice communications: VoIP and SIP, from macro to micro perspectives.&lt;/p>
&lt;ul>
&lt;li>We started with the &lt;strong>basic concept of VoIP&lt;/strong>, understanding how it transforms voice into data packets on IP networks, revolutionizing the traditional PSTN system.&lt;/li>
&lt;li>At the &lt;strong>macro level&lt;/strong>, we outlined VoIP's layered technology stack, clarifying the positions and collaborative relationships of key protocols such as SIP (signaling), RTP/RTCP (media), SDP (description), and UDP/TCP (transport).&lt;/li>
&lt;li>At the &lt;strong>micro level&lt;/strong>, we thoroughly analyzed the &lt;strong>SIP protocol&lt;/strong>&amp;lsquo;s core components (UA, Proxy, Registrar), its text message structure similar to HTTP, and the detailed signaling flow of a complete call from registration, establishment to termination. We also understood how &lt;strong>SDP&lt;/strong> negotiates media parameters through the Offer/Answer model.&lt;/li>
&lt;li>We delved into the &lt;strong>RTP protocol&lt;/strong> responsible for carrying actual voice data, understanding the critical importance of sequence numbers and timestamps in its header for handling out-of-order packets, jitter, and achieving synchronization, as well as the key role of &lt;strong>RTCP&lt;/strong> in QoS monitoring.&lt;/li>
&lt;li>We faced the biggest obstacle in real-world network deployment—&lt;strong>NAT&lt;/strong>, and detailed how the &amp;ldquo;three musketeers&amp;rdquo; &lt;strong>STUN, TURN, ICE&lt;/strong> work together to intelligently establish a media path that can penetrate routers.&lt;/li>
&lt;li>Finally, we discussed &lt;strong>VoIP security&lt;/strong> mechanisms, protecting signaling through &lt;strong>SIPS (TLS)&lt;/strong> and media through &lt;strong>SRTP&lt;/strong>, building end-to-end secure communications.&lt;/li>
&lt;/ul>
&lt;h3 id="future-outlook">Future Outlook&lt;/h3>
&lt;p>VoIP technology is far from stopping its development; it is evolving towards being more intelligent, integrated, and seamless.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Deep Integration with WebRTC&lt;/strong>: WebRTC (Web Real-Time Communication) has brought high-quality audio and video communication capabilities directly into browsers. Although WebRTC uses a set of standards based on Javascript API on the browser side, its underlying core concepts (ICE, STUN, TURN, (S)RTP, DTLS-SRTP) are in line with the VoIP technology stack we've discussed. In the future, traditional SIP systems and WebRTC-based applications will be more tightly interconnected, forming a seamless unified communication ecosystem.&lt;/li>
&lt;li>&lt;strong>AI-Empowered Communication Experience&lt;/strong>: Artificial intelligence is reshaping VoIP. For example:
&lt;ul>
&lt;li>&lt;strong>Intelligent Codecs (AI Codec)&lt;/strong>: Using machine learning to reconstruct high-quality voice at extremely low bandwidth.&lt;/li>
&lt;li>&lt;strong>Intelligent Noise Reduction and Echo Cancellation&lt;/strong>: Precisely separating human voice from background noise through AI models, achieving studio-level call quality.&lt;/li>
&lt;li>&lt;strong>Network Path Optimization&lt;/strong>: AI can analyze RTCP data and network telemetry data, predict network congestion, and proactively switch to better servers or network paths.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Immersive Communication&lt;/strong>: With the popularization of 5G and the rise of the metaverse concept, VoIP will no longer be limited to voice and flat video. Spatial Audio, VR/AR calls, and other immersive experiences will place higher demands on VoIP's latency, bandwidth, and synchronization, spurring new technological evolution.&lt;/li>
&lt;/ul>
&lt;p>From electric current on analog telephone lines, to data packets racing through IP networks, to future AI-empowered virtual space conversations, the revolution in communication technology never ceases. A profound understanding of the core technological principles represented by SIP and VoIP will be our solid foundation as we move forward in this wave.&lt;/p></description></item><item><title>WebRTC Technical Guide: Web-Based Real-Time Communication Framework</title><link>https://ziyanglin.netlify.app/en/post/webrtc-documentation/</link><pubDate>Thu, 26 Jun 2025 01:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/webrtc-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>WebRTC (Web Real-Time Communication) is an open-source technology that enables real-time voice and video communication in web browsers. It allows direct peer-to-peer (P2P) audio, video, and data sharing between browsers without requiring any plugins or third-party software.&lt;/p>
&lt;p>The main goal of WebRTC is to provide high-quality, low-latency real-time communication, making it easy for developers to build rich communication features into web applications.&lt;/p>
&lt;h3 id="core-advantages">Core Advantages&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Cross-platform and browser compatibility&lt;/strong>: WebRTC is an open standard by W3C and IETF, widely supported by major browsers (Chrome, Firefox, Safari, Edge).&lt;/li>
&lt;li>&lt;strong>No plugins required&lt;/strong>: Users can use real-time communication features directly in their browsers without downloading or installing any extensions.&lt;/li>
&lt;li>&lt;strong>Peer-to-peer communication&lt;/strong>: When possible, data is transmitted directly between users, reducing server bandwidth pressure and latency.&lt;/li>
&lt;li>&lt;strong>High security&lt;/strong>: All WebRTC communications are mandatorily encrypted (via SRTP and DTLS), ensuring data confidentiality and integrity.&lt;/li>
&lt;li>&lt;strong>High-quality audio and video&lt;/strong>: WebRTC includes advanced signal processing components like echo cancellation, noise suppression, and automatic gain control to provide excellent audio/video quality.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>WebRTC consists of several key JavaScript APIs that work together to enable real-time communication.&lt;/p>
&lt;h3 id="21-rtcpeerconnection">2.1. &lt;code>RTCPeerConnection&lt;/code>&lt;/h3>
&lt;p>&lt;code>RTCPeerConnection&lt;/code> is the core interface of WebRTC, responsible for establishing and managing connections between two peers. Its main responsibilities include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Media negotiation&lt;/strong>: Handling parameters for audio/video codecs, resolution, etc.&lt;/li>
&lt;li>&lt;strong>Network path discovery&lt;/strong>: Finding the best connection path through the ICE framework.&lt;/li>
&lt;li>&lt;strong>Connection maintenance&lt;/strong>: Managing the connection lifecycle, including establishment, maintenance, and closure.&lt;/li>
&lt;li>&lt;strong>Data transmission&lt;/strong>: Handling the actual transmission of audio/video streams (SRTP) and data channels (SCTP/DTLS).&lt;/li>
&lt;/ul>
&lt;p>An &lt;code>RTCPeerConnection&lt;/code> object represents a WebRTC connection from the local computer to a remote peer.&lt;/p>
&lt;h3 id="22-mediastream">2.2. &lt;code>MediaStream&lt;/code>&lt;/h3>
&lt;p>The &lt;code>MediaStream&lt;/code> API represents streams of media content. A &lt;code>MediaStream&lt;/code> object can contain one or more media tracks (&lt;code>MediaStreamTrack&lt;/code>), which can be:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Audio tracks (&lt;code>AudioTrack&lt;/code>)&lt;/strong>: Audio data from a microphone.&lt;/li>
&lt;li>&lt;strong>Video tracks (&lt;code>VideoTrack&lt;/code>)&lt;/strong>: Video data from a camera.&lt;/li>
&lt;/ul>
&lt;p>Developers typically use the &lt;code>navigator.mediaDevices.getUserMedia()&lt;/code> method to obtain a local &lt;code>MediaStream&lt;/code>, which prompts the user to authorize access to their camera and microphone. The obtained stream can then be added to an &lt;code>RTCPeerConnection&lt;/code> for transmission to the remote peer.&lt;/p>
&lt;h3 id="23-rtcdatachannel">2.3. &lt;code>RTCDataChannel&lt;/code>&lt;/h3>
&lt;p>In addition to audio and video, WebRTC supports the transmission of arbitrary binary data between peers through the &lt;code>RTCDataChannel&lt;/code> API. This provides powerful functionality for:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>File sharing&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Real-time text chat&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Online game state synchronization&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Remote desktop control&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>The &lt;code>RTCDataChannel&lt;/code> API is designed similarly to WebSockets, offering reliable and unreliable, ordered and unordered transmission modes that developers can choose based on application requirements. It uses the SCTP protocol (Stream Control Transmission Protocol) for transmission and is encrypted via DTLS.&lt;/p>
&lt;h2 id="3-connection-process-in-detail">3. Connection Process in Detail&lt;/h2>
&lt;p>Establishing a WebRTC connection is a complex multi-stage process involving signaling, session description, and network path discovery.&lt;/p>
&lt;h3 id="31-signaling">3.1. Signaling&lt;/h3>
&lt;p>Interestingly, the WebRTC API itself does not include a signaling mechanism. Signaling is the process of exchanging metadata between peers before establishing communication. Developers must choose or implement their own signaling channel. Common technologies include WebSocket or XMLHttpRequest.&lt;/p>
&lt;p>The signaling server acts as an intermediary, helping two clients who want to communicate exchange three types of information:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Session control messages&lt;/strong>: Used to open or close communication.&lt;/li>
&lt;li>&lt;strong>Network configuration&lt;/strong>: Information about the client's IP address and port.&lt;/li>
&lt;li>&lt;strong>Media capabilities&lt;/strong>: Codecs and resolutions supported by the client.&lt;/li>
&lt;/ol>
&lt;p>This process typically follows these steps:&lt;/p>
&lt;ol>
&lt;li>Client A sends a &amp;ldquo;request call&amp;rdquo; message to the signaling server.&lt;/li>
&lt;li>The signaling server forwards this request to client B.&lt;/li>
&lt;li>Client B agrees to the call.&lt;/li>
&lt;li>Afterward, clients A and B exchange SDP and ICE candidates through the signaling server until they find a viable connection path.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant ClientA as Client A
participant SignalingServer as Signaling Server
participant ClientB as Client B
ClientA-&amp;gt;&amp;gt;SignalingServer: Initiate call request (join room)
SignalingServer-&amp;gt;&amp;gt;ClientB: Forward call request
ClientB--&amp;gt;&amp;gt;SignalingServer: Accept call
SignalingServer--&amp;gt;&amp;gt;ClientA: B has joined
loop Offer/Answer &amp;amp; ICE Exchange
ClientA-&amp;gt;&amp;gt;SignalingServer: Send SDP Offer / ICE Candidate
SignalingServer-&amp;gt;&amp;gt;ClientB: Forward SDP Offer / ICE Candidate
ClientB-&amp;gt;&amp;gt;SignalingServer: Send SDP Answer / ICE Candidate
SignalingServer-&amp;gt;&amp;gt;ClientA: Forward SDP Answer / ICE Candidate
end
&lt;/code>&lt;/pre>
&lt;h3 id="32-session-description-protocol-sdp">3.2. Session Description Protocol (SDP)&lt;/h3>
&lt;p>SDP (Session Description Protocol) is a standard format for describing multimedia connection content. It doesn't transmit media data itself but describes the connection parameters. An SDP object includes:&lt;/p>
&lt;ul>
&lt;li>Session unique identifier and version.&lt;/li>
&lt;li>Media types (audio, video, data).&lt;/li>
&lt;li>Codecs used (e.g., VP8, H.264, Opus).&lt;/li>
&lt;li>Network transport information (IP addresses and ports).&lt;/li>
&lt;li>Bandwidth information.&lt;/li>
&lt;/ul>
&lt;p>WebRTC uses the &lt;strong>Offer/Answer model&lt;/strong> to exchange SDP information:&lt;/p>
&lt;ol>
&lt;li>The &lt;strong>Caller&lt;/strong> creates an &lt;strong>Offer&lt;/strong> SDP describing the communication parameters it desires and sends it to the receiver through the signaling server.&lt;/li>
&lt;li>The &lt;strong>Callee&lt;/strong> receives the Offer and creates an &lt;strong>Answer&lt;/strong> SDP describing the communication parameters it can support, sending it back to the caller through the signaling server.&lt;/li>
&lt;li>Once both parties accept each other's SDP, they have reached a consensus on the session parameters.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant Caller
participant SignalingServer as Signaling Server
participant Callee
Caller-&amp;gt;&amp;gt;Caller: createOffer()
Caller-&amp;gt;&amp;gt;Caller: setLocalDescription(offer)
Caller-&amp;gt;&amp;gt;SignalingServer: Send Offer
SignalingServer-&amp;gt;&amp;gt;Callee: Forward Offer
Callee-&amp;gt;&amp;gt;Callee: setRemoteDescription(offer)
Callee-&amp;gt;&amp;gt;Callee: createAnswer()
Callee-&amp;gt;&amp;gt;Callee: setLocalDescription(answer)
Callee-&amp;gt;&amp;gt;SignalingServer: Send Answer
SignalingServer-&amp;gt;&amp;gt;Caller: Forward Answer
Caller-&amp;gt;&amp;gt;Caller: setRemoteDescription(answer)
&lt;/code>&lt;/pre>
&lt;h3 id="33-interactive-connectivity-establishment-ice">3.3. Interactive Connectivity Establishment (ICE)&lt;/h3>
&lt;p>Since most devices are behind NAT (Network Address Translation) or firewalls and don't have public IP addresses, establishing direct P2P connections becomes challenging. ICE (Interactive Connectivity Establishment) is a framework specifically designed to solve this problem.&lt;/p>
&lt;p>The ICE workflow is as follows:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Gather candidate addresses&lt;/strong>: Each client collects its network address candidates from different sources:
&lt;ul>
&lt;li>&lt;strong>Local addresses&lt;/strong>: The device's IP address within the local network.&lt;/li>
&lt;li>&lt;strong>Server Reflexive Address&lt;/strong>: The device's public IP address and port discovered through a STUN server.&lt;/li>
&lt;li>&lt;strong>Relayed Address&lt;/strong>: A relay address obtained through a TURN server. When P2P direct connection fails, all data will be forwarded through the TURN server.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Exchange candidates&lt;/strong>: Clients exchange their collected ICE candidate lists through the signaling server.&lt;/li>
&lt;li>&lt;strong>Connectivity checks&lt;/strong>: Clients pair up the received candidate addresses and send STUN requests for connectivity checks (called &amp;ldquo;pings&amp;rdquo;) to determine which paths are available.&lt;/li>
&lt;li>&lt;strong>Select the best path&lt;/strong>: Once a viable address pair is found, the ICE agent selects it as the communication path and begins transmitting media data. P2P direct connection paths are typically prioritized because they have the lowest latency.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Client A
A1(Start) --&amp;gt; A2{Gather Candidates};
A2 --&amp;gt; A3[Local Address];
A2 --&amp;gt; A4[STUN Address];
A2 --&amp;gt; A5[TURN Address];
end
subgraph Client B
B1(Start) --&amp;gt; B2{Gather Candidates};
B2 --&amp;gt; B3[Local Address];
B2 --&amp;gt; B4[STUN Address];
B2 --&amp;gt; B5[TURN Address];
end
A2 --&amp;gt; C1((Signaling Server));
B2 --&amp;gt; C1;
C1 --&amp;gt; A6(Exchange Candidates);
C1 --&amp;gt; B6(Exchange Candidates);
A6 --&amp;gt; A7{Connectivity Checks};
B6 --&amp;gt; B7{Connectivity Checks};
A7 -- STUN Request --&amp;gt; B7;
B7 -- STUN Response --&amp;gt; A7;
A7 --&amp;gt; A8(Select Best Path);
B7 --&amp;gt; B8(Select Best Path);
A8 --&amp;gt; A9((P2P Connection Established));
B8 --&amp;gt; B9((P2P Connection Established));
&lt;/code>&lt;/pre>
&lt;h2 id="4-nat-traversal-stun-and-turn">4. NAT Traversal: STUN and TURN&lt;/h2>
&lt;p>To achieve P2P connections, WebRTC heavily relies on STUN and TURN servers to solve NAT-related issues.&lt;/p>
&lt;h3 id="41-stun-servers">4.1. STUN Servers&lt;/h3>
&lt;p>STUN (Session Traversal Utilities for NAT) servers are very lightweight, with a simple task: telling a client behind NAT what its public IP address and port are.&lt;/p>
&lt;p>When a WebRTC client sends a request to a STUN server, the server checks the source IP and port of the request and returns them to the client. This way, the client knows &amp;ldquo;what it looks like on the internet&amp;rdquo; and can share this public address as an ICE candidate with other peers.&lt;/p>
&lt;p>Using STUN servers is the preferred approach for establishing P2P connections because they are only needed during the connection establishment phase and don't participate in actual data transmission, resulting in minimal overhead.&lt;/p>
&lt;h3 id="42-turn-servers">4.2. TURN Servers&lt;/h3>
&lt;p>However, in some complex network environments (such as symmetric NAT), peers cannot establish direct connections even if they know their public addresses. This is where TURN (Traversal Using Relays around NAT) servers come in.&lt;/p>
&lt;p>A TURN server is a more powerful relay server. When P2P connection fails, both clients connect to the TURN server, which then forwards all audio, video, and data between them. This is no longer true P2P communication, but it ensures that connections can still be established under the worst network conditions.&lt;/p>
&lt;p>Using TURN servers increases latency and server bandwidth costs, so they are typically used as a last resort.&lt;/p>
&lt;h2 id="5-security">5. Security&lt;/h2>
&lt;p>Security is a core principle in WebRTC design, with all communications mandatorily encrypted and unable to be disabled.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Signaling security&lt;/strong>: The WebRTC standard doesn't specify a signaling protocol but recommends using secure WebSocket (WSS) or HTTPS to encrypt signaling messages.&lt;/li>
&lt;li>&lt;strong>Media encryption&lt;/strong>: All audio/video streams use &lt;strong>SRTP (Secure Real-time Transport Protocol)&lt;/strong> for encryption. SRTP prevents eavesdropping and content tampering by encrypting and authenticating RTP packets.&lt;/li>
&lt;li>&lt;strong>Data encryption&lt;/strong>: All &lt;code>RTCDataChannel&lt;/code> data is encrypted using &lt;strong>DTLS (Datagram Transport Layer Security)&lt;/strong>. DTLS is a protocol based on TLS that provides the same security guarantees for datagrams.&lt;/li>
&lt;/ul>
&lt;p>Key exchange is automatically completed during the &lt;code>RTCPeerConnection&lt;/code> establishment process through the DTLS handshake. This means a secure channel is established before any media or data exchange occurs.&lt;/p>
&lt;h2 id="6-practical-application-cases">6. Practical Application Cases&lt;/h2>
&lt;p>With its powerful features, WebRTC has been widely applied in various scenarios:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Video conferencing systems&lt;/strong>: Such as Google Meet, Jitsi Meet, etc., allowing multi-party real-time audio/video calls.&lt;/li>
&lt;li>&lt;strong>Online education platforms&lt;/strong>: Enabling remote interactive teaching between teachers and students.&lt;/li>
&lt;li>&lt;strong>Telemedicine&lt;/strong>: Allowing doctors to conduct video consultations with patients remotely.&lt;/li>
&lt;li>&lt;strong>P2P file sharing&lt;/strong>: Using &lt;code>RTCDataChannel&lt;/code> for fast file transfers between browsers.&lt;/li>
&lt;li>&lt;strong>Cloud gaming and real-time games&lt;/strong>: Providing low-latency instruction and data synchronization for games.&lt;/li>
&lt;li>&lt;strong>Online customer service and video support&lt;/strong>: Businesses providing real-time video support services to customers through web pages.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>WebRTC is a revolutionary technology that brings real-time communication capabilities directly into browsers, greatly lowering the barrier to developing rich media applications. Through the three core APIs of &lt;code>RTCPeerConnection&lt;/code>, &lt;code>MediaStream&lt;/code>, and &lt;code>RTCDataChannel&lt;/code>, combined with powerful signaling, ICE, and security mechanisms, WebRTC provides a complete, robust, and secure real-time communication solution.&lt;/p>
&lt;p>As network technology develops and 5G becomes more widespread, WebRTC's application scenarios will become even broader, with its potential in emerging fields such as IoT, augmented reality (AR), and virtual reality (VR) gradually becoming apparent. For developers looking to integrate high-quality, low-latency communication features into their applications, WebRTC is undoubtedly one of the most worthwhile technologies to focus on and learn about today.&lt;/p></description></item></channel></rss>