February 12, 2025
  • Knowledge lineage is an instrumental a part of Meta’s Privateness Conscious Infrastructure (PAI) initiative, a collection of applied sciences that effectively defend person privateness. It’s a vital and highly effective instrument for scalable discovery of related information and information flows, which helps privateness controls throughout Meta’s methods. This enables us to confirm that our customers’ on a regular basis interactions are protected throughout our household of apps, corresponding to their spiritual views within the Fb Relationship app, the instance we’ll stroll by way of on this publish.
  • As a way to construct high-quality information lineage, we developed totally different methods to gather information movement indicators throughout totally different expertise stacks: static code evaluation for various languages, runtime instrumentation, and enter and output information matching, and many others. We then constructed an intuitive UX into our tooling that permits builders to successfully devour all of this lineage information in a scientific manner, saving important engineering time for constructing privateness controls. 
  • As we expanded PAI throughout Meta, we gained valuable insights in regards to the information lineage house. Our understanding of the privateness house advanced, revealing the necessity for early concentrate on information lineage, tooling, a cohesive ecosystem of libraries, and extra. These initiatives have assisted in accelerating the event of knowledge lineage and implementing function limitation controls extra shortly and effectively.

At Meta, we imagine that privateness permits product innovation. This perception has led us to creating Privateness Conscious Infrastructure (PAI), which gives environment friendly and dependable first-class privateness constructs embedded in Meta infrastructure to deal with totally different privateness necessities, corresponding to function limitation, which restricts the needs for which information may be processed and used. 

On this weblog, we’ll delve into an early stage in PAI implementation: information lineage. Knowledge lineage refers back to the technique of tracing the journey of knowledge because it strikes by way of varied methods, illustrating how information transitions from one information asset, corresponding to a database desk (the supply asset), to a different (the sink asset). We’ll additionally stroll by way of how we monitor the lineage of customers’ “faith” info in our Fb Relationship app.

Tens of millions of knowledge belongings are important for supporting our product ecosystem, guaranteeing the performance our customers anticipate, sustaining excessive product high quality, and safeguarding person security and integrity. Knowledge lineage permits us to effectively navigate these belongings and defend person information. It enhances the traceability of knowledge flows inside methods, finally empowering builders to swiftly implement privateness controls and create progressive merchandise.

Word that information lineage relies on having already accomplished vital and complicated preliminary steps to stock, schematize, and annotate information belongings right into a unified asset catalog. This took Meta a number of years to finish throughout our tens of millions of disparate information belongings, and we’ll cowl every of those extra deeply in future weblog posts:

  • Inventorying entails accumulating varied code and information belongings (e.g., internet endpoints, information tables, AI fashions) used throughout Meta.
  • Schematization expresses information belongings in structural element (e.g., indicating {that a} information asset has a area known as “faith”).
  • Annotation labels information to explain its content material (e.g., specifying that the identification column comprises faith information).

Understanding information lineage at Meta

To determine strong privateness controls, an important a part of our PAI initiative is to grasp how information flows throughout totally different methods. Knowledge lineage is a part of this discovery step within the PAI workflow, as proven within the following diagram:

Knowledge lineage is a key precursor to implementing Coverage Zones, our info movement management expertise, as a result of it solutions the query, “The place does my information come from and the place does it go?” – serving to inform the suitable locations to use privateness controls. At the side of Coverage Zones, information lineage gives the next key advantages to 1000’s of builders at Meta: 

  • Scalable information movement discovery: Knowledge lineage solutions the query above by offering an end-to-end, scalable graph of related information flows. We will leverage the lineage graphs to visualise and clarify the movement of related information from the purpose the place it’s collected to all of the locations the place it’s processed.
  • Environment friendly rollout of privateness controls: By leveraging information lineage to trace information flows, we are able to simply pinpoint the optimum integration factors for privateness controls like Coverage Zones throughout the codebase, streamlining the rollout course of. Thus we have now developed a strong movement discovery instrument as a part of our PAI instrument suite, Coverage Zone Supervisor (PZM), primarily based on information lineage. PZM permits builders to quickly determine a number of downstream belongings from a set of sources concurrently, thereby accelerating the rollout technique of privateness controls.
  • Steady compliance verification: As soon as the privateness requirement has been absolutely carried out, information lineage performs a significant function in monitoring and validating information flows constantly, along with the enforcement mechanisms corresponding to Coverage Zones.

Historically, information lineage has been collected through code inspection utilizing manually authored information movement diagrams and spreadsheets. Nonetheless, this strategy doesn’t scale in giant and dynamic environments, corresponding to Meta, with billions of strains of constantly evolving code. To sort out this problem, we’ve developed a sturdy and scalable lineage resolution that makes use of static code evaluation indicators in addition to runtime indicators.

Walkthrough: Implementing information lineage for faith information

We’ll share how we have now automated lineage monitoring to determine faith information flows by way of our core methods, ultimately creating an end-to-end, exact view of downstream faith belongings being protected, through the next two key phases:

  1. Accumulating information movement indicators: a course of to seize information movement indicators from many processing actions throughout totally different methods, not just for faith, however for all different sorts of information, to create an end-to-end lineage graph. 
  2. Figuring out related information flows: a course of to determine the precise subset of knowledge flows (“subgraph”) throughout the lineage graph that pertains to faith. 

These phases propagate by way of varied methods together with function-based methods that load, course of, and propagate information by way of stacks of perform calls in numerous programming languages (e.g., Hack, C++, Python, and many others.) corresponding to internet methods and backend providers, and batch-processing methods that course of information rows in batch (primarily through SQL) corresponding to information warehouse and AI methods.

For simplicity, we’ll show these for the online, the info warehouse, and AI, per the diagram beneath.

Accumulating information movement indicators for the online system

When establishing a profile on the Fb Relationship app, folks can populate their spiritual views. This info is then utilized to determine related matches with different individuals who have specified matched values of their relationship preferences. On Relationship, spiritual views are topic to function limitation necessities, for instance, they will not be used to personalize experiences on other Facebook Products.

We begin with somebody coming into their faith info on their relationship media profile utilizing their cell system, which is then transmitted to an online endpoint. The online endpoint subsequently logs the info right into a logging desk and shops it in a database, as depicted within the following code snippet:

Now let’s see how we accumulate lineage indicators. To do that, we have to make use of each static and runtime evaluation instruments to successfully uncover information flows, notably specializing in the place faith is logged and saved. By combining static and runtime evaluation, we improve our potential to precisely monitor and handle information flows.

Static evaluation instruments simulate code execution to map out information flows inside our methods. Additionally they emit high quality indicators to point the boldness of whether or not a knowledge movement sign is a real optimistic. Nonetheless, these instruments are restricted by their lack of entry to runtime information, which may result in false positives from unexecuted code.

To deal with this limitation, we make the most of Privateness Probes, a key part of our PAI lineage applied sciences. Privateness Probes automate information movement discovery by accumulating runtime indicators. These indicators are gathered in actual time through the execution of requests, permitting us to hint the movement of knowledge into loggers, databases, and different providers. 

We’ve instrumented Meta’s core information frameworks and libraries at each the info origin factors (sources) and their eventual outputs (sinks), corresponding to logging framework, which permits for complete information movement monitoring. This strategy is exemplified within the following code snippet:


Throughout runtime execution, Privateness Probes does the next:

  1. Capturing payloads: It captures supply and sink payloads in reminiscence on a sampled foundation, together with supplementary metadata corresponding to occasion timestamps, asset identifiers, and stack traces as proof for the info movement. 
  2. Evaluating payloads: It then compares the supply and sink payloads inside a request to determine information matches, which helps in understanding how information flows by way of the system. 
  3. Categorizing outcomes: It categorizes outcomes into two units. The match-set consists of pairs of supply and sink belongings the place information matches precisely or one is contained by one other, subsequently offering excessive confidence proof of knowledge movement between the belongings. The full-set consists of all supply and sink pairs inside a request irrespective of whether or not the sink is tainted by the supply. Full-set is a superset of match-set with some noise however nonetheless vital to ship to human reviewers since it could include reworked information flows. 

The above process is depicted within the diagram beneath:

Let’s take a look at the next examples the place varied religions are obtained in an endpoint and varied values (copied or reworked) being logged in three totally different loggers:

Enter Worth (supply) Output Worth (sink) Knowledge Operation Match Outcome Circulation Confidence
“Atheist” “Atheist” Knowledge Copy EXACT_MATCH HIGH
“Buddhist” metadata: faith: Buddhist Substring CONTAINS HIGH
religions:
[“Catholic”, “Christian”]
rely : 2 Remodeled NO_MATCH LOW


Within the examples above, the primary two rows present a exact match of religions within the supply and the sink values, thus belonging to the excessive confidence match-set. The third row depicts a reworked information movement the place the enter string worth is reworked to a rely of values earlier than being logged, belonging to full-set. 

These indicators collectively are used to assemble a lineage graph to grasp the movement of knowledge by way of our internet system as proven within the following diagram:

Accumulating information movement indicators for the info warehouse system

With the person’s faith logged in our internet system, it could possibly propagate to the info warehouse for offline processing. To collect information movement indicators, we make use of a mix of each runtime instrumentation and static code evaluation differently from the online system. The concerned SQL queries are logged for information processing actions by the Presto and Spark compute engines (amongst others). Static evaluation is then carried out for the logged SQL queries and job configs with the intention to extract information movement indicators.

Let’s look at a easy SQL question instance that processes information for the info warehouse as the next:


We’ve developed a SQL analyzer to extract information movement indicators between the enter desk, “safety_log_tbl” and the output desk, “safety_training_tbl” as proven within the following diagram. In follow, we additionally accumulate extra granular-level lineage corresponding to at column-level (e.g., “user_id” -> “target_user_id”, “faith” -> “target_religion”).

Tlisted here are cases the place information shouldn’t be absolutely processed by SQL queries, leading to logs that include information movement indicators for both reads or writes, however not each. To make sure we have now full lineage information, we leverage contextual info (corresponding to execution environments; job or hint IDs) collected at runtime to attach these reads and writes collectively. 

The next diagram illustrates how the lineage graph has expanded:

Accumulating information movement indicators for the AI system

For our AI methods, we accumulate lineage indicators by monitoring relationships between varied belongings, corresponding to enter datasets, options, fashions, workflows, and inferences. A standard strategy is to extract information flows from job configurations used for various AI actions corresponding to mannequin coaching.

As an illustration, with the intention to enhance the relevance of relationship matches, we use an AI mannequin to suggest potential matches primarily based on shared spiritual views from customers. Let’s check out the next coaching config instance for this mannequin that makes use of faith information:

By parsing this config obtained from the mannequin coaching service, we are able to monitor the info movement from the enter dataset (with asset ID asset://hive.desk/dating_training_tbl) and have (with asset ID asset://ai.function/DATING_USER_RELIGION_SCORE) to the mannequin (with asset ID asset://ai.mannequin/dating_ranking_model).

Our AI methods are additionally instrumented in order that asset relationships and information movement indicators are captured at varied factors at runtime, together with data-loading layers (e.g., DPP) and libraries (e.g., PyTorch), workflow engines (e.g., FBLearner Circulation), coaching frameworks, inference methods (as backend providers), and many others. Lineage assortment for backend providers makes use of the strategy for function-based methods described above. By matching the supply and sink belongings for various information movement indicators, we’re capable of seize a holistic lineage graph on the desired granularities:

Figuring out related information flows from a lineage graph

Now that we have now the lineage graph at our disposal, how can we successfully distill a subset of knowledge flows pertinent to a selected privateness requirement for faith information? To deal with this query, we have now developed an iterative evaluation instrument that permits builders to pinpoint exact information flows and systematically filter out irrelevant ones. The instrument kicks off a repetitive discovery course of aided by the lineage graph and privateness controls from Coverage Zones, to slender down essentially the most related flows. This refined information permits builders to make a remaining willpower in regards to the flows they wish to use, producing an optimum path for traversing the lineage graph. The next are the main steps concerned, captured holistically within the diagram, beneath:

  1. Uncover information flows: determine information flows from supply belongings and cease at downstream belongings with low-confidence flows (yellow nodes). 
  2. Exclude and embody candidates: Builders or automated heuristics exclude candidates (purple nodes) that don’t have faith information or embody remaining ones (inexperienced nodes). By excluding the purple nodes early on, it helps to exclude all of their downstream in a cascaded method, and thus saves developer efforts considerably. As a further safeguard, builders additionally implement privateness controls through Coverage Zones, so all related information flows may be captured.
  3. Repeat discovery cycle: use the inexperienced nodes as new sources and repeat the cycle till no extra inexperienced nodes are confirmed. 

With the gathering and information movement identification steps full, builders are capable of efficiently find granular information flows that include faith throughout Meta’s advanced methods, permitting them to maneuver ahead within the PAI workflow to use obligatory privateness controls to safeguard the info. This once-intimidating activity has been accomplished effectively. 

Our information lineage expertise has offered builders with an unprecedented potential to shortly perceive and defend faith and comparable delicate information flows. It permits Meta to scalably and effectively implement privateness controls through PAI to guard our customers’ privateness and ship merchandise safely.

Learnings and challenges

As we’ve labored to develop and implement lineage as a core PAI expertise, we’ve gained helpful insights and overcome important challenges, yielding some vital classes:

  • Concentrate on lineage early and reap the rewards: As we developed privateness applied sciences like Coverage Zones, it turned clear that gaining a deep understanding of knowledge flows throughout varied methods is crucial for scaling the implementation of privateness controls. By investing in lineage, we not solely accelerated the adoption of Coverage Zones but in addition uncovered new alternatives for making use of the expertise. Lineage may also be prolonged to different use instances corresponding to safety and integrity.
  • Construct lineage consumption instruments to realize engineering effectivity: We initially centered on constructing a lineage resolution however didn’t give ample consideration to consumption instruments for builders. Because of this, house owners had to make use of uncooked lineage indicators to find related information flows, which was overwhelmingly advanced. We addressed this situation by creating the iterative tooling to information engineers in discovering related information flows, considerably decreasing engineering efforts by orders of magnitude.
  • Combine lineage with methods to scale the protection: Accumulating lineage from various Meta methods was a major problem. Initially, we tried to ask each system to gather lineage indicators to ingest into the centralized lineage service, however the progress was sluggish. We overcame this by creating dependable, computationally environment friendly, and broadly relevant PAI libraries with built-in lineage assortment logic in varied programming languages (Hack, C++, Python, and many others.). This enabled a lot smoother integration with a broad vary of Meta’s methods.
  • Measurement improves our outcomes: By incorporating the measurement of protection, we’ve been capable of evolve our information lineage in order that we keep forward of the ever-changing panorama of knowledge and code at Meta. By enhancing our indicators and adapting to new applied sciences, we are able to preserve a robust concentrate on privateness outcomes and drive ongoing enhancements in lineage protection throughout our tech stacks.

The way forward for information lineage

Knowledge lineage is an important part of Meta’s PAI initiative, offering a complete view of how information flows throughout totally different methods. Whereas we’ve made important progress in establishing a robust basis, our journey is ongoing. We’re dedicated to:

  • Increasing protection: constantly improve the protection of our information lineage capabilities to make sure a complete understanding of knowledge flows.
  • Bettering consumption expertise: streamline the consumption expertise to make it simpler for builders and stakeholders to entry and make the most of information lineage info.
  • Exploring new frontiers: examine new functions and use instances for information lineage, driving innovation and collaboration throughout the business.

By advancing information lineage, we purpose to foster a tradition of privateness consciousness and drive progress within the broader fields of examine. Collectively, we are able to create a extra clear and accountable information ecosystem.

Acknowledgements

The authors wish to acknowledge the contributions of many present and former Meta workers who’ve performed a vital function in creating information lineage applied sciences over time. Particularly, we wish to lengthen particular because of (in alphabetical order) Amit Jain, Aygun Aydin, Ben Zhang, Brian Romanko, Brian Spanton, Daniel Ramagem, David Molnar, Dzmitry Charnahalau, Gayathri Aiyer, George Stasa, Guoqiang Jerry Chen, Graham Bleaney, Haiyang Han, Howard Cheng, Ian Carmichael, Ibrahim Mohamed, Jerry Pan, Jiang Wu, Jonathan Bergeron, Joanna Jiang, Jun Fang, Kiran Badam, Komal Mangtani, Kyle Huang, Maharshi Jha, Manuel Fahndrich, Marc Celani, Lei Zhang, Mark Vismonte, Perry Stoll, Pritesh Shah, Qi Zhou, Rajesh Nishtala, Rituraj Kirti, Seth Silverman, Shelton Jiang, Sushaant Mujoo, Vlad Fedorov, Yi Huang, Xinbo Gao, and Zhaohui Zhang. We’d additionally like to specific our gratitude to all reviewers of this publish, together with (in alphabetical order) Aleksandar Ilic, Avtar Brar, Benjamin Renard, Bogdan Shubravyi, Brianna O’Steen, Chris Wiltz, Daniel Chamberlain, Hannes Roth, Imogen Barnes, Jason Hendrickson, Koosh Orandi, Rituraj Kirti, and Xenia Habekoss. We wish to particularly thank Jonathan Bergeron for overseeing the trouble and offering all the steerage and helpful suggestions, Supriya Anand for main the editorial effort to form the weblog content material, and Katherine Bates for pulling all required assist collectively to make this weblog publish occur.