September 11, 2024
Ran Zhang
The Airbnb Tech Blog

Airbnb’s Use of A New Flink platform developed from Apache Hadoop® Yarn

At Airbnb, Apache Flink was launched in 2018 as a supplementary answer for stream processing. It ran alongside Apache Spark™ Streaming for a number of years earlier than transitioning to turn out to be the first stream processing platform. On this weblog put up, we’ll delve into the evolution of Flink structure at Airbnb and examine our prior Hadoop Yarn platform with the present Kubernetes-based structure. Moreover, we’ll focus on the efforts undertaken all through the migration course of and discover the challenges that arose throughout this journey. Ultimately we’ll summarize the impression, learnings alongside the way in which and future plans.

The evolution of Airbnb’s streaming processing structure primarily based on Apache Flink may be categorized into three distinct phases:

Section One: Flink jobs operated on Hadoop Yarn with Apache Airflow serving because the job scheduler.

Round 2018, a number of groups at Airbnb adopted Flink as their streaming processing engine, primarily attributable to its superior low-latency capabilities in comparison with Spark Streaming. Throughout this era, Flink jobs have been working on Hadoop Yarn, and Airflow was employed because the workflow supervisor for process scheduling and dependency administration.

The collection of Airflow because the workflow supervisor was largely influenced by its widespread use in addressing numerous job scheduling wants, as there have been no different user-friendly open-source alternate options available at the moment. Every crew was answerable for dealing with their Airflow Directed Acyclic Graphs (DAGs), job supply code, and the requisite dependency JARs. Usually, Flink JAR information have been domestically constructed earlier than deployment to Amazon S3.

The structure catered to our necessities throughout that interval with a restricted vary of use instances.

From 2019 onwards, Apache Flink gained important traction at Airbnb, changing Spark Streaming as the first stream processing platform. With the scaling in utilization of Flink we encountered numerous challenges and limitations on this structure. To start with, Airflow’s batch-oriented design, counting on polling intervals, didn’t match Airbnb’s wants, and we skilled important delays in job begin and failure restoration, typically inflicting SLA violations for low-latency use instances. Airflow additionally precipitated a singleton situation as duplicate job submissions sometimes happen attributable to race circumstances amongst Airflow staff and consumer operations not following anticipated patterns. In addition to, Airflow’s Directed Acyclic Graph (DAG) construction is complicated and doesn’t perform nicely with a few of Airbnb’s streaming use instances. We additionally encountered engineering context mismatch on this structure: product engineers would possibly discover themselves unfamiliar with Apache Airflow and Hadoop, leading to a steep studying curve when organising new Apache Flink jobs.

To sort out the above technical and operational challenges, we began to discover new potentialities. Our preliminary step concerned changing Airflow with a personalized light-weight streaming job scheduler, marking the inception of Section Two.

Section Two: Flink jobs operated on Hadoop Yarn, with a light-weight streaming job scheduler.

At a excessive stage, Airflow was changed by a light-weight streaming job scheduler working on Kubernetes. The job scheduler accommodates a grasp node and a pool of employee nodes:

The grasp node is answerable for managing the metadata of all Flink jobs and guaranteeing the right life cycle of every employee node. This contains duties reminiscent of parsing user-provided job configurations, synchronizing metadata and job statuses with Apache Zookeeper™, and guaranteeing that employee nodes constantly preserve their anticipated states.

A employee node is answerable for dealing with the dependencies and life cycle of a single Flink job. Staff package deal the required dependencies, submit the Flink job to Hadoop Yarn, constantly monitor its standing, and within the occasion of a failure, it triggers a direct restart.

The Section 2 design resulted in quicker turnaround time and decreased downtime throughout job restarts. It additionally resolved single level of failure points with Zookeeper.

As utilization of Flink grew, we encountered new challenges in Section Two:

  • Lack of CI/CD: Flink builders needed to devise their very own model management methods.
  • Absence of native secrets and techniques administration: There is no such thing as a vanilla secrets and techniques administration on Hadoop Yarn.
  • Restricted useful resource and dependency isolation: Every supported Flink model needed to be manually preinstalled on the Yarn cluster. Whereas Yarn’s useful resource queues might present some stage of useful resource isolation, job-level isolation was absent.
  • Service Discovery complexity: As extra use instances have been onboarded, every doubtlessly requiring entry to varied inner Airbnb companies, configuring service entry on Yarn proved to be cumbersome. It pressured a binary selection between enabling service entry for your complete cluster or none in any respect.
  • Monitoring and debugging challenges: Managing and sustaining the logging pipeline and SSH entry grew to become non-trivial duties on a multi-tenant Yarn cluster.
  • Ongoing complexity and dependencies: Though the Flink job scheduler was light-weight in comparison with Airflow, it launched further complexities.

Section Three (present state): Flink jobs run on Kubernetes, and the job scheduler is eradicated.

Deploying Flink on Kubernetes permits direct Flink deployment on a working Kubernetes cluster. With this integration we will discover enabling environment friendly autoscaling and the Kubernetes operator to simplify the administration of Flink jobs and clusters.

Flink on Kubernetes gives a number of benefits over Hadoop Yarn addressing the above challenges:

  • Developer expertise: Standardized by integrating with the present CI/CD techniques.
  • Secrets and techniques Administration: With Flink on Kubernetes, every Flink job can securely retailer its personal secrets and techniques inside the pods. This gives a safer approach to handle delicate info.
  • Remoted Surroundings: Jobs working on Flink on Kubernetes profit from isolation at each the useful resource and dependency ranges. Every job can run on its devoted Flink model if supported by its picture, permitting for higher administration of dependencies.
  • Enhanced Monitoring: Integration with Airbnb’s pre-defined logging and metric sidecars on Kubernetes simplifies setup and improves monitoring. This permits detailed insights into particular person pods and charge limiting for logging per pod, making it simpler to trace and troubleshoot points.
  • Service Discovery: Flink jobs now adhere to Airbnb’s standardized method for service discovery, utilizing the cluster mesh. This ensures constant and dependable communication between companies.
  • Simplified SSH entry: Customers with the suitable permissions can now SSH into the Flink pod with out the necessity for an SSH tunnel. This gives larger flexibility and management over SSH permissions per job.

Moreover, we’ve noticed an growing stage of Kubernetes assist and adoption inside the Flink neighborhood, which elevated our confidence in working Flink on Kubernetes.

It’s price mentioning that Kubernetes brings its personal dangers and limitations. As an illustration, a single Flink process supervisor failover can result in the pause of your complete job course of. This could pose points in eventualities with frequent node rotations inside Kubernetes and enormous jobs deployed with a whole lot of process managers. For context, node rotation on Kubernetes is carried out to make sure the operability and stability of the cluster. It entails changing current nodes with new ones, sometimes with up to date configurations or to carry out upkeep duties, with the targets of making use of host configuration modifications, sustaining node stability and enhancing operational effectivity. As compared, node rotations on Yarn happen much less regularly, so the impression on job availability is much less important. We’ll discover how we’re mitigating these challenges within the Future Work part.

Beneath is an summary of our present structure:

To offer a greater understanding of the system, under is a deep dive of the 5 major parts, in addition to how customers work together with them when organising a brand new Flink job:

  • Job configurations: This serves as an abstraction layer over Kubernetes and CI/CD parts, offering Flink customers with a simplified interface for creating Flink utility templates. It shields customers from the complexities of the underlying Kubernetes infrastructure. Flink customers outline the core specs of their Flink job through a configuration file. This contains crucial info just like the entrypoint class identify, job parallelism, and the required ingress companies and sinks.
  • Picture administration: This element entails the pre-construction of Flink base photos, that are bundled with important dependencies required to entry Airbnb sources. These photos are saved in Amazon Elastic Container Registry and may be readily deployed with consumer Jars or additional personalized to fulfill particular consumer wants.
  • CI/CD: By introducing just a few customizations to assist Flink’s stateful deployment, we’ve built-in Flink with our current CI/CD system, offering a standardized model management and steady supply expertise. Flink jobs are deployed inside Kubernetes, every residing in its distinct namespace to make sure isolation and efficient administration.
  • Flink portal: an API service that gives important options for managing the states of Flink jobs. These options embody stopping a Flink job with a savepoint and querying accomplished checkpoints on Amazon S3. Moreover, it gives a self-service UI portal, enabling customers to watch and verify the standing of their jobs. Customers additionally achieve entry to crucial job state administration functionalities, empowering them to both provoke the job from a bootstrapped savepoint or resume it from a earlier checkpoint.
  • Flink job runtime: Every Flink job is deployed as an unbiased utility cluster on Kubernetes. To make sure fault tolerance and state storage, Zookeeper, ETCD, and Amazon S3 are utilized. Moreover, pre-configured sidecar containers accompany the Flink containers to supply assist for crucial capabilities reminiscent of logging, metrics, DNS, and extra. A service mesh is employed to facilitate communication between Flink jobs and different microservices.

Improved Developer Velocity

Onboarding Flink jobs is quicker, the place our builders famous that it takes hours as a substitute of days, and builders can focus extra on their utility logic.

Enchancment in Flink Job Availability and Latency

The structure of Flink on Kubernetes improves job availability and scheduling latency by eliminating sure parts of the Flink consumer and job scheduler present in Flink on Yarn.

Price Financial savings in Infrastructure

The streamlining of Flink infrastructure complexity and the elimination of sure parts, such because the job scheduler, have resulted in price financial savings in our infrastructure. Moreover, by working Flink jobs on a shared Kubernetes cluster at Airbnb, we might doubtlessly enhance the general price effectivity of our firm’s infrastructure.

Enchancment in Job Availability

Within the Flink world, node rotations in Kubernetes could cause job restarts and lead to downtime. Whereas Flink itself can recuperate from job restarts with out information loss, the potential downtime and availability impression could also be unfavorable for extremely latency-sensitive purposes. To handle this, there are just a few approaches we’re evaluating.

  1. Decreasing the variety of node rotations to reduce job restarts.
  2. Sooner job restoration.

Allow Job Autoscaling

With the introduction of Reactive Mode in Flink 1.13, customers can dynamically alter the parallelism of their jobs with out the necessity for a job restart. This job auto scaling characteristic can improve job stability and price effectivity. Sooner or later we might allow autoscaling for Flink Kubernetes workloads by leveraging system metrics (reminiscent of CPU utilization) and Flink metrics (reminiscent of backpressure), to find out the suitable parallelism.

Flink Kubernetes Operator

The Flink Kubernetes Operator makes use of Customized Sources and capabilities as a controller to handle your complete manufacturing lifecycle of Flink purposes. By leveraging the operator, we will streamline the operation and deployment processes for Flink jobs. It gives higher management over deployment and lifecycle of jobs, and an out of field answer for autoscaling and auto tuning.

To summarize, the migration of Airbnb’s streaming processing structure primarily based on Apache Flink from Hadoop Yarn to Kubernetes has been a big milestone in enhancing our streaming information processing capabilities. This transition has resulted in a extra streamlined and user-friendly expertise for Flink builders. By overcoming challenges that have been complicated to handle on Yarn, we now have laid the muse for extra environment friendly and efficient streaming information processing.

As we glance forward, we’re dedicated to additional refining our method and resolving any remaining challenges. We’re enthusiastic concerning the ongoing development and potential of Apache Flink inside our firm, and we anticipate continued innovation and enchancment sooner or later.

If this type of work sounds interesting to you, try our open roles — we’re hiring!

The Flink on Kubernetes platform wouldn’t have been doable with out cross-functional and cross-org collaborators in addition to management assist. They embody, however are usually not restricted to: Jingwei Lu, Lengthy Zhang, Daniel Low, Weibo He, Zack Loebel-Begelman, Justin Cunningham, Adam Kocoloski, Liyin Tang and Nathan Towery.

Particular because of the broader Airbnb information neighborhood members who supplied enter or help to the implementation crew all through the design, growth, and launch phases.

We additionally wish to thank Wei Hou and Xu Zhang for his or her assist in authoring this put up throughout their time at Airbnb.

Apache Spark™, Apache Airflow™, and Apache ZooKeeper™ are logos of The Apache Software program Basis.

Apache Flink® and Apache Hadoop® are registered logos of The Apache Software program Basis.

Kubernetes® is a registered trademark of The Linux Basis.

Amazon S3 and AWS are logos of Amazon.com, Inc. or its associates.

All product names, logos, and types are property of their respective house owners. All firm, product and repair names used on this web site are for identification functions solely. Use of those names, logos, and types doesn’t suggest endorsement.