Posts

OpenAI WebRTC API Review

Image
There is a new interface added to OpenAI RealTime models. Now it supports WebRTC! Given the people working on it I'm sure it has to be great so as usual let’s take a look and see what is under the hood in terms of audio transmission. Signalling or Establishment of the connection There are two options for the establishment of a RealTime session with the OpenAI servers: WebSocket signalling : much nicer API without ugly SDPs involved but less suited for public networks. HTTP/WebRTC signalling : has an uglier API including SDP offer/answer negotiations but can work well in real networks that is critical for most of the use cases. In the rest of the post we will focus only in the later (HTTP/WebRTC) that is the most interesting one. Authentication The first step to use these RealTime APIs sending audio data directly from clients to OpenAI servers is to obtain an ephemeral key using you OpenAI API Secret. This is a simple HTTP request that for testing you can do from the command line: `...

Target Bitrates vs Max Bitrates

Image
  Not all the simulcast layers have the same encoding quality When using simulcast video encoding with WebRTC, the encoder generates different versions or layers of the video input with varying resolutions. Using this techniques a multiparty video server (SFU) can adapt the video that each participant in a room receives based on factors such as available bandwidth, CPU/battery level, or the rendering size of those videos in each receiver. How simulcast works with an SFU forwarding layers selectively These different versions of the video have varying resolutions, but what about their encoding quality? For example, if a user is receiving a video and rendering it in a window of 640x360, would he get the same quality if he receives the 640x360 layer as if he receives the highest layer of 1280x720? To answer this question about the quality of each resolution, we can examine first the bitrates used by each. But the interesting thing is that the bitrate of each resolution is not always th...

The Impact of Bursty Packet Loss on Audio Quality in WebRTC

Image
 Ensuring high-quality audio in WebRTC encounters a pivotal challenge amidst less than ideal network conditions, predominantly driven by the burstiness of packet loss. This phenomenon is prevalent in congested networks, areas with low mobile coverage, and public Wi-Fi setups. Within the WebRTC framework, an array of strategies exists to mitigate packet loss, yet their efficacy varies depending on the specific network dynamics. Among the most prevalent techniques are: OPUS Forward Error Correction (FEC): Each audio packet incorporates low-bitrate data from preceding packet, facilitating potential recovery in the event of a single packet lost. Packet Retransmissions: Leveraging standard NACK/RTX mechanisms, the receiver requests retransmission upon detecting packet sequence gaps. Packet Duplication: Sending multiple instances of the same packet aims to compensate for potential losses. It is like sending preemptive retransmissions to mitigate the impact of potential packet loss. Re...

Loss based bandwidth estimation in WebRTC

Image
Measuring available bandwidth and avoiding congestion is the most critical and complex part of the video pipeline in WebRTC. The concept of bandwidth estimation (BWE) is simple: monitor packet latency, and if latency increases or packet loss occurs, back off and send less data. The first part is known as delay-based estimation, while the second part, less known, is referred to as loss-based estimation. In the original implementation of WebRTC, the logic for loss-based estimation was straightforward: if there was more than 2% packet loss don't increase the bitrate sent and if it is more than 10% reduce the bitrate being sent. However, this naive approach had a flaw. Some networks also experience packet loss not due to congestion but inherent to the network itself (e.g., certain WiFi networks). We call that packet loss static or inherent packet loss. To address this issue, the latest versions of Google’s WebRTC library introduced a more modern and sophisticated solution after seve...

Audio Mixing or Forwarding

Image
How many audio streams should your WebRTC server forward to the participants in a room? There are various options, ranging from the simplest approach of forwarding everything, to the most extreme option of mixing all audio and sending just a single stream. A few weeks ago, we engaged in a Twitter conversation about this very topic . Following that discussion, bloggeek also wrote a post on the subject . For me it is always interesting to see what different types of applications are doing because at least in some of those cases they have the ability to do A/B testing and compare the results with millions of users before making a decision. The simplest way to determine the best approach is to enter a room with different applications and inspect the SDP (Session Description Protocol) in chrome://webrtc-internals . Within this tool, you can examine how many channels are being forwarded when you're in a room and look for potential clues within the SDP (some people use the "mixed...

Architecture for AI integration in conferencing applications

Image
With the latest improvements in ML technology, especially generative algorithms and large language models, more and more conferencing applications are adding these capabilities to their offerings. This ML technology can be applied to conferencing applications at two different levels: the infrastructure level with improvements in media handling and transmission, and the application level with new features or capabilities for the users. At the infrastructure level (codecs, noise suppression, etc.) most of the high-level ideas were covered in this other post . Some interesting recent advances are applying “ML codecs” for audio redundancy, and the next frontier is applying generative algorithms also to video, as well as general applications to photorealistic avatars. This post focuses on the second level (the application part) and how to implement typical features such as summarization, image generation, or moderation. The idea here is to present a reference architecture that can be used...

Review of Signaling in different WebRTC applications

Image
  This post provides a quick review of the signaling channel implementation in various popular WebRTC platforms. It examines the protocol used for the channel, how messages are serialized, and whether the applications use Session Description Protocol (SDP) as an opaque string over the wire, or if they instead send the required parameters in a custom format. To provide a variety of platforms, I have included a mix of popular end-user applications, cloud providers, and open-source implementations in the table. If you would like, I am happy to add others to the list. How was it tested? To test it, join a room and check in Chrome Developer Tools whether there are WebSocket connections established or periodic HTTP requests being made. Then, inspect the messages of those connections and requests and check if the format is Binary/JSON/XML. In case of Binary messages, it's harder to see the content, and there's a chance that the information is compressed/encrypted, and there's s...