May 18, 2024
  • At Meta, we assist real-time communication (RTC) for billions of individuals by our apps, together with Messenger, Instagram, and WhatsApp.
  • We’ve seen important advantages by adopting the AV1 codec for RTC.
  • Right here’s how we’re bettering the RTC video high quality for our apps with instruments just like the AV1 codec, the challenges we face, and the way we mitigate these challenges.

The previous few a long time have seen great enhancements in cell phone digicam high quality in addition to video high quality for streaming video providers. But when we have a look at real-time communication (RTC) purposes, whereas the video high quality additionally has improved over time, it has all the time lagged behind that of digicam high quality. 

Once we checked out methods to enhance video high quality for RTC throughout our household of apps, AV1 stood out as the most suitable choice. Meta has more and more adopted the AV1 codec through the years as a result of it affords excessive video high quality at bitrates a lot decrease than older codecs. However, as we’ve implemented AV1 for mobile RTC, we’ve additionally needed to handle plenty of challenges together with scaling,  bettering video high quality for low-bandwidth customers in addition to high-end networks, CPU and battery utilization, and sustaining high quality stability.

Bettering video high quality for low-bandwidth networks

This submit goes to give attention to peer-to-peer (P2P, or 1:1) calls, which contain two members. 

Individuals who use our services and products expertise a variety of community situations – some have actually nice networks, whereas others are utilizing throttled or low-bandwidth networks.

This chart illustrates what the distribution of bandwidth appears to be like like for a few of these calls on Messenger:

Determine 1: Bandwidth distribution of P2P calls on Messenger.

As seen in Determine 1, some calls function in very low-bandwidth situations. 

We contemplate something lower than 300 Kbps to be a low-end community, however we additionally see loads of video calls working at simply 50 Kbps, and even beneath 25 Kbps.

Be aware that this bandwidth is the share for the video encoder. Whole bandwidth is shared with audio, RTP overhead, signaling overhead, RTX (re-transmissions of packets to deal with misplaced packets)/FEC (ahead error correction)/duplication (packet duplication), and so forth. The large assumption right here is that the bandwidth estimator is working appropriately and estimating true bitrates. 

There are not any common definitions for low, mid, and excessive networks, however for the aim of this weblog submit, lower than 300 Kbps shall be thought of as low, 300-800 Kbps as mid, and above 800 Kbps as a excessive, HD-capable, or high-end community.

Once we seemed into bettering the video high quality for low-bandwidth customers, there have been few key choices. Migrating to a more moderen codec similar to AV1 introduced the best alternative. Different choices similar to higher video scalers and region-of-interest encoding supplied incremental enhancements. 

Video scalers

We use WebRTC in most of our apps, however the video scalers shipped with WebRTC don’t have the highest quality video scaling. We now have been capable of enhance the video scaling high quality considerably by leveraging in-house scalers. 

At low bitrates, we frequently find yourself downscaling the video to encode at ¼ decision (assuming the digicam seize is 640×480 or 1280×720). With our customized scaler implementations, we have now seen important enhancements in video high quality. From public checks we noticed positive factors in peak signal-to-noise ratio (PSNR) by 0.75 db on common.

Here’s a snapshot exhibiting outcomes with the default libyuv scaler (a field filter):

Determine 2.a: Video picture outcomes utilizing WebRTC/libyuv video scaler.

And the outcomes after downscaling with our video scaler:

Determine 2.b: Video picture outcomes utilizing Meta’s video scaler.

Area-of-interest encoding

Figuring out the area of curiosity (ROI) allowed us to optimize by spending extra encoder bitrate within the space that’s most essential to a viewer (the speaker’s face in a speaking head video, for instance). Most cell units have APIs to find the face area with out using any CPU overhead. As soon as we have now discovered the face area we are able to configure the encoder to spend extra bits on this essential area and fewer on the remaining. The best means to do that was to have some APIs on encoders to configure the quantization parameters (QP) for ROI versus the remainder of the picture. These adjustments supplied incremental enhancements within the video high quality metrics like PSNR. 

Adopting the AV1 video codec

The video encoder is a key ingredient in terms of video high quality for RTC. H.264 has been the preferred codec over the past decade, with {hardware} assist and most purposes supporting it. However it’s a 20-year-old codec. Again in 2018, the Alliance for Open Media (AOMedia) standardized the AV1 video codec. Since then, a number of corporations together with Meta, YouTube, and Netflix have deployed it at a large scale for video streaming

At Meta, transferring from H.264 to AV1 led us to our biggest enhancements in video high quality at low bitrates.

Determine 3: Enhancements over time, transferring from H.262 to AV1 and H.266

Why AV1?

We selected to make use of AV1 partially as a result of it’s royalty-free. Codec licensing (and concurrent charges) was an essential facet in our decision-making course of. Usually, if an software makes use of a tool’s {hardware} codec, no further codec licensing prices shall be incurred. But when an software is delivery a software program model of the codec,  there’ll most definitely be licensing prices to cowl.

However why do we have to use software program codecs although most telephones have hardware-supported codecs?

Most cell units have devoted {hardware} for video encoding and decoding. And today most cell units assist H.264 and even H.265s. However these encoders are designed for widespread use instances similar to digicam seize, which makes use of a lot larger resolutions, body charges, and bitrates. Most cell machine {hardware} is at the moment able to encoding 4K 60 FPS in actual time with very low battery utilization, however the outcomes of encoding a 7 FPS, 320×180, 200 Kbps video are sometimes worse than these of software program encoders working on the identical cell machine. 

The rationale for that’s prioritization of the RTC use case. Most impartial {hardware} distributors (IHVs) will not be conscious of the community situations the place RTC calls function; therefore, these {hardware} codecs will not be optimized for RTC situations, particularly for low bitrates, resolutions, and body charges. So, we leverage software program encoders when working in these low bitrates to supply high-quality video.

And since we are able to’t ship software program codecs with out a license, AV1 is an excellent possibility for RTC.

AV1 for RTC

The most important motive to maneuver to a extra superior video codec is easy: The identical high quality expertise may be delivered with a a lot decrease bitrate, and we are able to ship a a lot higher-quality real-time calling expertise for our customers who’re on bandwidth-constrained networks.

Measuring video high quality is a fancy subject, however a comparatively easy means to take a look at it’s to make use of the Bjontegaard Delta-Bit Rate (BD-BR) metric. BD-BR compares how a lot bitrate numerous codecs want to supply a sure high quality stage. By producing a number of samples at completely different bitrates, measuring the standard of the produced video offers a rate-distortion (RD) curve, and from the RD curve you possibly can derive the BD-BR (as proven under).

As may be seen in Determine 4, AV1 supplied larger high quality for all bitrate ranges in our native checks.

Determine 4: Bitrate distortion comparability chart.

Display-encoding instruments

AV1 additionally has just a few key instruments which are helpful for RTC. Display content material high quality is changing into an more and more essential issue for Meta, with related use instances, together with display screen sharing, recreation streaming, and VR distant desktop, requiring high-quality encoding. In these areas, AV1 really shines. 

Historically, video encoders aren’t nicely suited to advanced content material similar to textual content with loads of high-frequency content material, and people are delicate to studying blurry textual content. AV1 has a set of coding instruments—palette mode and intra-block copy—that drastically enhance efficiency for display screen content material. Palette mode is designed in response to the commentary that the pixel values in a screen-content body often think about the restricted variety of shade values. Palette mode can signify the display screen content material effectively by signaling the colour clusters as an alternative of the quantized transform-domain coefficients. As well as, for typical display screen content material, repetitive patterns can often be discovered inside the similar image. Intra-block copy facilitates block prediction inside the similar body, in order that the compression effectivity may be improved considerably. That AV1 offers these two instruments on the baseline profile is a large plus.

Reference image resampling: Fewer key frames

One other helpful characteristic is reference image resampling (RPR), which permits decision adjustments with out producing a key body. In video compression, a key body is one which’s encoded independently, like a nonetheless picture. It’s the one sort of body that may be decoded with out having one other body as reference. 

For RTC purposes, for the reason that bandwidth retains on altering typically, there are frequent decision adjustments wanted to adapt to those community adjustments. With older codecs like H.264, every of those decision adjustments requires a key body that’s a lot bigger in measurement and thus inefficient for RTC apps. Such massive key frames enhance the quantity of information needing to be despatched over the community and end in larger end-to-end latencies and congestion. 

By utilizing RPR, we are able to keep away from producing any key frames.

Challenges round bettering video high quality for low-bandwidth customers

CPU/battery utilization

AV1 is nice for coding effectivity, however codecs obtain this at the price of larger CPU and battery utilization. Numerous fashionable codecs pose these challenges when working real-time purposes on cell platforms.

Based mostly on native lab testing, we anticipated a roughly 4 % enhance in battery utilization, and we noticed related ends in public checks. We used an influence meter to do that native battery measurement.

Despite the fact that the AV1 encoder itself elevated CPU utilization three-fold when in comparison with H.264 implementation, the general contribution of CPU utilization from the encoder was a small a part of the battery utilization. The telephone show display screen, networking/radio, and different processes utilizing the CPU contribute considerably to battery utilization, therefore the rise in battery utilization was 5-6 % (a big enhance in battery utilization). 

Numerous calls run out of machine battery, or individuals grasp up as soon as their working system signifies a low battery, so growing battery utilization isn’t worthwhile for customers until it offers elevated worth similar to video high quality enchancment. Even then it’s a trade-off between video high quality versus battery use.

We use WebRTC and Session Description Protocol (SDP) for codec negotiation, which permits us to barter a number of codecs (e.g., AV1 and H.264) up entrance after which change the codecs with none want for signaling or a handshake in the course of the name. This implies the codec change is seamless, with out customers noticing any glitches or pauses in video.

We created a customized encoder that encapsulates each H.264 and the AV1 encoders. We name it a hybrid encoder. This allowed us to modify the codec in the course of the name based mostly on triggers similar to CPU utilization, battery stage, or encoding time — and to modify to the extra battery-efficient H.264 encoder when wanted. 

Elevated crashes and out of reminiscence errors

Even with out new leaks added, AV1 used extra reminiscence than H.264. Any time further reminiscence is used, apps usually tend to hit out of reminiscence (OOM) crashes or hit OOM sooner due to different leaks or reminiscence calls for on the system from different apps. To mitigate this, we needed to disable AV1 on units with low reminiscence. That is one space for enchancment and for additional optimizing the encoder’s reminiscence utilization.

In-product high quality measurement

To match the standard between H.264 and AV1 utilizing public checks, we wanted a low-complexity metric. Metrics similar to encoded bitrates and body charges received’t present any positive factors as the overall bandwidth obtainable to ship video continues to be the identical, as these are restricted by the community capability, which implies the bitrates and body charges for video is not going to change a lot with the change within the codec. We had been utilizing composite metrics that mix the quantization parameter (QP is commonly used as a proxy for video high quality, as this introduces pixel knowledge loss in the course of the encoding course of), resolutions, and body price, and freezes it to generate video composite metrics, however QP will not be comparable between AV1 and H.264 codecs, and therefore can’t be used.

PSNR is a normal metric, however it’s reference-based and therefore doesn’t work for RTC. Non-reference, video-quality metrics are fairly CPU-intensive (e.g., BRISQUE: Blind/Referenceless Picture Spatial High quality Evaluator), although we’re exploring these as nicely.

Determine 6: Excessive-level structure for PSNR computation in RTC.

We now have give you a framework for PSNR computation. We first modified the encoder to report distortions attributable to compression (most software program encoders have already got assist for this metric). Then we designed a light-weight, scaling-distortion algorithm that estimates the distortion launched by video scaling. This algorithm can mix these scaling distortions with the encoder distortions to supply output PSNR. We developed and verified this algorithm regionally and shall be sharing the findings in publications and at tutorial conferences over the following yr. With this light-weight PSNR metric, we noticed 2 db enhancements with AV1 in comparison with H.264.

Challenges round bettering video high quality for high-end networks

As a fast evaluate: For our functions, excessive bandwidth covers customers for whom bandwidth is bigger than 800 kbps. 

Through the years, there have been big enhancements in digicam seize high quality. In consequence, individuals’s expectations have gone up, and so they wish to see RTC video high quality on par with native digicam seize high quality. 

Based mostly on native testing, we settled on settings leading to video high quality that appears just like that of digicam recordings. We name this HD mode. We discovered that with a video codec like H.264 encoding at 3.5 Mbps and 30 frames per second, 720p decision seemed similar to native digicam recordings. We additionally in contrast 720p to 1080p in subjective high quality checks and located that the distinction will not be noticeable on most units aside from these with a bigger display screen after we performed subjective high quality checks.

Bandwidth estimator enhancements

Bettering the video high quality for customers who’ve high-end telephones with good CPUs, good batteries, {hardware} codecs, and good community speeds appears trivial. It might look like all you need to do is enhance the utmost bitrate, seize decision, and seize body charges, and customers will ship high-quality video. However, in actuality, it’s not that easy. 

Should you enhance the bitrate, you expose your bandwidth estimation and congestion detection algorithm to hit congestion extra typically, and your algorithm shall be examined many extra occasions than if you weren’t utilizing these larger bitrates. 

Determine 7: Instance exhibiting how utilizing larger bandwidth will increase the cases for congestion.

Should you have a look at the community pipeline in Determine 7, the upper the bitrates you might be utilizing, the extra your algorithm/code shall be examined for robustness over the time of the RTC name. Determine 7 reveals how utilizing 1 Mbps hits extra congestion than utilizing 500 Kbps and utilizing 3 Mbps hits extra congestion than 1 Mbps, and so forth. In case you are utilizing bandwidths decrease than the minimal throughput of the community, nevertheless, you received’t hit congestion in any respect. For instance, see the 500-Kbps name in Determine 7. 

To mitigate these points, we improved congestion detection. For instance, we added customized ISP throttling detection, one thing that was not being caught by the standard delay-based estimator of WebRTC. 

Bandwidth estimator and community resilience comprise a fancy space on their very own, and that is the place RTC merchandise stand out. They’ve their very own customized algorithms that work greatest for his or her merchandise and clients.

Steady high quality

Folks don’t like oscillations in video high quality. These can occur after we ship high-quality video for just a few seconds after which drop again to low-quality due to congestion. Studying from previous historical past, we added assist in  bandwidth estimation to forestall these oscillations.

Audio is extra essential than video for RTC

When community congestion happens, all media packets might be misplaced. This causes video freezes and damaged audio, (aka, robotic audio). For RTC, each are dangerous, however audio high quality is extra essential than video. 

Damaged audio typically utterly prevents conversations from taking place, typically inflicting individuals to hold up or redial the decision. Damaged video, however, typically ends in much less pleasant conversations, however, relying on the situation, it is also a block for some customers.

At excessive bitrates like 2.5 Mbps and better, you possibly can afford to have three to 5 occasions extra audio packets or duplication with none noticeable degradation to video. When working in these larger bitrates with cellular phone connections, we noticed extra of those congestion, packet loss, and ISP throttling points, so we needed to make adjustments to our community resiliency algorithms. And since persons are extremely delicate to knowledge utilization on their cell telephones, we disabled excessive bitrates on mobile connections.

When to allow HD?

We used ML-based concentrating on to guess which name needs to be HD-capable. We relied on the community stats from the customers’ earlier calls to foretell if HD needs to be enabled or not.

Battery regressions

We now have a number of metrics, together with efficiency, networking, and media high quality, to trace the standard of RTC calls. Once we ran checks for HD, we observed regressions in battery metrics. What we discovered was that almost all battery regressions don’t come from larger bitrates or decision however from the seize body charges.

To mitigate the regressions, we constructed a mechanism for detecting each caller and callee machine capabilities, together with machine mannequin, battery ranges, Wi-Fi or cell utilization, and so forth. To allow high-quality modes, we examine each side of the decision to make sure that they fulfill the necessities and solely then will we allow these high-quality, resource-intensive configurations.

Determine 8: Signaling server setup for turning HD on or off.

What the long run holds for RTC

{Hardware} producers are acknowledging the numerous advantages of utilizing AV1 for RTC. The brand new Apple iPhone 15 Professional helps AV1’s {hardware} decoder, and the Google Pixel 8 helps AV1 encoding and decoding. {Hardware} codecs are an absolute necessity for high-end community and HD resolutions. Video calling is changing into as ubiquitous as conventional audio calling and we hope that as {hardware} producers acknowledge this shift, there shall be extra alternatives for collaboration between RTC app creators and {hardware} producers to optimize encoders for these situations. 

On the software program facet, we are going to proceed to work on optimizing AV1 software program encoders and growing new encoder implementations. We attempt to present the very best expertise for our customers, however on the similar time we wish to let individuals have full management over their RTC expertise. We’ll present controls to the customers in order that they will select whether or not they need larger high quality at the price of battery and knowledge utilization, or vice versa.

We additionally plan to work with IHVs to collaborate on {hardware} codec improvement to make these codecs usable for RTC situations together with low-bandwidth use instances. 

We additionally will examine forward-looking options similar to video processing to extend the decision and body charges on the receiver’s rendering stack and leveraging AI/ML to enhance bandwidth estimation (BWE) and community resiliency.

Additional, we’re investigating Pixel Codec Avatar applied sciences that may permit us to transmit the mannequin/share as soon as after which ship the geometry/vectors for receiver facet rendering. This allows video rendering with a lot smaller bandwidth utilization than conventional video codecs for RTC situations.