May 18, 2024
  • Bandwidth estimation (BWE) and congestion management play an necessary position in delivering high-quality real-time communication (RTC) throughout Meta’s household of apps.
  • We’ve adopted a machine studying (ML)-based strategy that enables us to unravel networking issues holistically throughout cross-layers reminiscent of BWE, community resiliency, and transport.
  • We’re sharing our experiment outcomes from this strategy, a few of the challenges we encountered throughout execution, and learnings for brand spanking new adopters.

Our current bandwidth estimation (BWE) module at Meta is based on WebRTC’s Google Congestion Controller (GCC). We have now made a number of enhancements by way of parameter tuning, however this has resulted in a extra advanced system, as proven in Determine 1.

Determine 1: BWE module’s system diagram for congestion management in RTC.

One problem with the tuned congestion management (CC)/BWE algorithm was that it had a number of parameters and actions that had been depending on community situations. For instance, there was a trade-off between high quality and reliability; enhancing high quality for high-bandwidth customers usually led to reliability regressions for low-bandwidth customers, and vice versa, making it difficult to optimize the person expertise for various community situations.

Moreover, we observed some inefficiencies with regard to enhancing and sustaining the module with the advanced BWE module:

  1. Because of the absence of real looking community situations throughout our experimentation course of, fine-tuning the parameters for person purchasers necessitated a number of makes an attempt.
  2. Even after the rollout, it wasn’t clear if the optimized parameters had been nonetheless relevant for the focused community sorts.
  3. This resulted in advanced code logics and branches for engineers to keep up.

To unravel these inefficiencies, we developed a machine studying (ML)-based, network-targeting strategy that gives a cleaner various to hand-tuned guidelines. This strategy additionally permits us to unravel networking issues holistically throughout cross-layers reminiscent of BWE, community resiliency, and transport.

Community characterization

An ML model-based strategy leverages time collection knowledge to enhance the bandwidth estimation through the use of offline parameter tuning for characterised community sorts. 

For an RTC name to be accomplished, the endpoints have to be related to one another by way of community gadgets. The optimum configs which have been tuned offline are saved on the server and will be up to date in real-time. Through the name connection setup, these optimum configs are delivered to the consumer. Through the name, media is transferred instantly between the endpoints or by way of a relay server. Relying on the community indicators collected in the course of the name, an ML-based strategy characterizes the community into differing types and applies the optimum configs for the detected kind.

Determine 2 illustrates an instance of an RTC name that’s optimized utilizing the ML-based strategy.  

Determine 2: An instance RTC name configuration with optimized parameters delivered from the server and primarily based on the present community kind.

Mannequin studying and offline parameter tuning

On a excessive degree, community characterization consists of two major parts, as proven in Determine 3. The primary part is offline ML mannequin studying utilizing ML to categorize the community kind (random packet loss versus bursty loss). The second part makes use of offline simulations to tune parameters optimally for the categorized community kind. 

Determine 3: Offline ML-model studying and parameter tuning.

For mannequin studying, we leverage the time collection knowledge (community indicators and non-personally identifiable info, see Determine 6, beneath) from manufacturing calls and simulations. In comparison with the mixture metrics logged after the decision, time collection captures the time-varying nature of the community and dynamics. We use FBLearner, our inside AI stack, for the coaching pipeline and ship the PyTorch mannequin information on demand to the purchasers in the beginning of the decision.

For offline tuning, we use simulations to run community profiles for the detected sorts and select the optimum parameters for the modules primarily based on enhancements in technical metrics (reminiscent of high quality, freeze, and so forth.).

Mannequin structure

From our expertise, we’ve discovered that it’s obligatory to mix time collection options with non-time collection (i.e., derived metrics from the time window) for a extremely correct modeling.

To deal with each time collection and non-time collection knowledge, we’ve designed a mannequin structure that may course of enter from each sources.

The time collection knowledge will cross by way of a long short-term memory (LSTM) layer that can convert time collection enter right into a one-dimensional vector illustration, reminiscent of 16×1. The non-time collection knowledge or dense knowledge will cross by way of a dense layer (i.e., a completely related layer). Then the 2 vectors might be concatenated, to completely characterize the community situation prior to now, and handed by way of a completely related layer once more. The ultimate output from the neural community mannequin would be the predicted output of the goal/activity, as proven in Determine 4. 

Determine 4: Mixed-model structure with LSTM and Dense Layers

Use case: Random packet loss classification

Let’s think about the use case of categorizing packet loss as both random or congestion. The previous loss is because of the community parts, and the latter is because of the limits in queue size (that are delay dependent). Right here is the ML activity definition:

Given the community situations prior to now N seconds (10), and that the community is at the moment incurring packet loss, the purpose is to characterize the packet loss on the present timestamp as RANDOM or not.

Determine 5 illustrates how we leverage the structure to realize that purpose:

Determine 5: Mannequin structure for a random packet loss classification activity.

Time collection options

We leverage the next time collection options gathered from logs:

Determine 6: Time collection options used for mannequin coaching.

BWE optimization

When the ML mannequin detects random packet loss, we carry out native optimization on the BWE module by:

  • Rising the tolerance to random packet loss within the loss-based BWE (holding the bitrate).
  • Rising the ramp-up velocity, relying on the hyperlink capability on excessive bandwidths.
  • Rising the community resiliency by sending further forward-error correction packets to recuperate from packet loss.

Community prediction

The community characterization downside mentioned within the earlier sections focuses on classifying community sorts primarily based on previous info utilizing time collection knowledge. For these easy classification duties, we obtain this utilizing the hand-tuned guidelines with some limitations. The actual energy of leveraging ML for networking, nonetheless, comes from utilizing it for predicting future community situations.

We have now utilized ML for fixing congestion-prediction issues for optimizing low-bandwidth customers’ expertise.

Congestion prediction

From our evaluation of manufacturing knowledge, we discovered that low-bandwidth customers usually incur congestion because of the habits of the GCC module. By predicting this congestion, we are able to enhance the reliability of such customers’ habits. In the direction of this, we addressed the next downside assertion utilizing round-trip time (RTT) and packet loss:

Given the historic time-series knowledge from manufacturing/simulation (“N” seconds), the purpose is to foretell packet loss attributable to congestion or the congestion itself within the subsequent “N” seconds; that’s, a spike in RTT adopted by a packet loss or an extra progress in RTT.

Determine 7 reveals an instance from a simulation the place the bandwidth alternates between 500 Kbps and 100 Kbps each 30 seconds. As we decrease the bandwidth, the community incurs congestion and the ML mannequin predictions hearth the inexperienced spikes even earlier than the delay spikes and packet loss happen. This early prediction of congestion is useful in sooner reactions and thus improves the person expertise by stopping video freezes and connection drops.

Determine 7: Simulated community situation with alternating bandwidth for congestion prediction

Producing coaching samples

The principle problem in modeling is producing coaching samples for quite a lot of congestion conditions. With simulations, it’s more durable to seize various kinds of congestion that actual person purchasers would encounter in manufacturing networks. Consequently, we used precise manufacturing logs for labeling congestion samples, following the RTT-spikes standards prior to now and future home windows in accordance with the next assumptions:

  • Absent previous RTT spikes, packet losses prior to now and future are unbiased.
  • Absent previous RTT spikes, we can’t predict future RTT spikes or fractional losses (i.e., flosses).

We break up the time window into previous (4 seconds) and future (4 seconds) for labeling.

Determine 8: Labeling standards for congestion prediction

Mannequin efficiency

Not like community characterization, the place floor reality is unavailable, we are able to acquire floor reality by inspecting the long run time window after it has handed after which evaluating it with the prediction made 4 seconds earlier. With this logging info gathered from actual manufacturing purchasers, we in contrast the efficiency in offline coaching to on-line knowledge from person purchasers:

Determine 9: Offline versus on-line mannequin efficiency comparability.

Experiment outcomes

Listed below are some highlights from our deployment of varied ML fashions to enhance bandwidth estimation:

Reliability wins for congestion prediction

connection_drop_rate -0.326371 +/- 0.216084
✅ last_minute_quality_regression_v1 -0.421602 +/- 0.206063
✅ last_minute_quality_regression_v2 -0.371398 +/- 0.196064
✅ bad_experience_percentage -0.230152 +/- 0.148308
✅ transport_not_ready_pct -0.437294 +/- 0.400812

peer_video_freeze_percentage -0.749419 +/- 0.180661
✅ peer_video_freeze_percentage_above_500ms -0.438967 +/- 0.212394

High quality and person engagement wins for random packet loss characterization in excessive bandwidth

peer_video_freeze_percentage -0.379246 +/- 0.124718
✅ peer_video_freeze_percentage_above_500ms -0.541780 +/- 0.141212
✅ peer_neteq_plc_cng_perc -0.242295 +/- 0.137200

✅ total_talk_time 0.154204 +/- 0.148788

Reliability and high quality wins for mobile low bandwidth classification

✅ connection_drop_rate -0.195908 +/- 0.127956
✅ last_minute_quality_regression_v1 -0.198618 +/- 0.124958
✅ last_minute_quality_regression_v2 -0.188115 +/- 0.138033

✅ peer_neteq_plc_cng_perc -0.359957 +/- 0.191557
✅ peer_video_freeze_percentage -0.653212 +/- 0.142822

Reliability and high quality wins for mobile excessive bandwidth classification

✅ avg_sender_video_encode_fps 0.152003 +/- 0.046807
✅ avg_sender_video_qp -0.228167 +/- 0.041793
✅ avg_video_quality_score 0.296694 +/- 0.043079
✅ avg_video_sent_bitrate 0.430266 +/- 0.092045

Future plans for making use of ML to RTC

From our mission execution and experimentation on manufacturing purchasers, we observed {that a} ML-based strategy is extra environment friendly in focusing on, end-to-end monitoring, and updating than conventional hand-tuned guidelines for networking. Nonetheless, the effectivity of ML options largely depends upon knowledge high quality and labeling (utilizing simulations or manufacturing logs). By making use of ML-based options to fixing community prediction issues – congestion particularly – we absolutely leveraged the ability of ML. 

Sooner or later, we might be consolidating all of the community characterization fashions right into a single mannequin utilizing the multi-task strategy to repair the inefficiency attributable to redundancy in mannequin obtain, inference, and so forth. We might be constructing a shared illustration mannequin for the time collection to unravel totally different duties (e.g., bandwidth classification, packet loss classification, and so forth.) in community characterization. We are going to concentrate on constructing real looking manufacturing community eventualities for mannequin coaching and validation. It will allow us to make use of ML to determine optimum community actions given the community situations. We are going to persist in refining our learning-based strategies to boost community efficiency by contemplating current community indicators.