May 18, 2024
  • Programs and utility logs play a key function in operations, observability, and debugging workflows at Meta.
  • Logarithm is a hosted, serverless, multitenant service, used solely internally at Meta, that consumes and indexes these logs and gives an interactive question interface to retrieve and look at logs.
  • On this put up, we current the design behind Logarithm, and present the way it powers AI coaching debugging use instances.

Logarithm indexes 100+GB/s of logs in actual time, and hundreds of queries a second. We designed the system to help service-level ensures on log freshness, completeness, sturdiness, question latency, and question outcome completeness. Customers can emit logs utilizing their alternative of logging library (the frequent library at Meta is the Google Logging Library [glog]). Customers can question utilizing common expressions on log strains, arbitrary metadata fields connected to logs, and throughout log information of hosts and companies.

Logarithm is written in C++20 and the codebase follows trendy C++ patterns, together with coroutines and async execution. This has supported each efficiency and maintainability, and helped the crew transfer quick – growing Logarithm in simply three years.

Logarithm’s information mannequin

Logarithm represents logs as a named log stream of (host-local) time-ordered sequences of immutable unstructured textual content, akin to a single log file. A course of can emit a number of log streams (stdout, stderr, and customized log information). Every log line can have zero or extra metadata key-value pairs connected to it. A standard instance of metadata is rank ID in machine studying (ML) coaching, when a number of sequences of log strains are multiplexed right into a single log stream (e.g., in PyTorch).

Logarithm helps typed buildings in two methods – through typed APIs (ints, floats, and strings), and extraction from a log line utilizing regex-based parse-and-extract guidelines – a typical instance is metrics of tensors in ML mannequin logging. The extracted key-value pairs are added to the log line’s metadata.

Determine 1: Logarithm information mannequin. The bins on textual content symbolize typed buildings.

AI coaching debugging with Logarithm

Earlier than taking a look at Logarithm’s internals, we current help for coaching methods and mannequin difficulty debugging, one of many distinguished use instances of Logarithm at Meta. ML mannequin coaching workflows are likely to have a variety of failure modes, spanning information inputs, mannequin code and hyperparameters, and methods elements (e.g., PyTorch, information readers, checkpointing, framework code, and {hardware}). Additional, failure root causes evolve over time sooner than conventional service architectures because of rapidly-evolving workloads, from scale to architectures to sharding and optimizations. With a purpose to triage such a dynamic nature of failures, it’s crucial to gather detailed telemetry on the methods and mannequin telemetry.

Since coaching jobs run for prolonged durations of time, coaching methods and mannequin telemetry and state must be repeatedly captured so as to have the ability to debug a failure with out reproducing the failure with extra logging (which is probably not deterministic and wastes GPU assets).

Given the size of coaching jobs, methods and mannequin telemetry are usually detailed and really high-throughput – logs are comparatively low-cost to put in writing (e.g., in comparison with metrics, relational tables, and traces) and have the knowledge content material to energy debugging use instances.

We stream, index and question high-throughput logs from methods and mannequin layers utilizing Logarithm.

Logarithm ingests each methods logs from the coaching stack, and mannequin telemetry from coaching jobs that the stack executes. In our setup, every host runs a number of PyTorch ranks (processes), one per GPU, and the processes write their output streams to a single log file. Debugging distributed job failures results in ambiguity because of lack of rank info in log strains, and including it implies that we modify all logging websites (together with third-party code). With the Logarithm metadata API, course of context resembling rank ID is connected to each log line – the API provides it into thread-local context and attaches a glog handler.

We added UI instruments to allow frequent log-based interactive debugging primitives. The next figures present screenshots of two such options (on high of Logarithm’s filtering operations).

Filter–by-callsite allows hiding recognized log strains or verbose/noisy logging websites when strolling by a log stream. Strolling by a number of log streams side-by-side allows discovering rank state that’s totally different from different ranks (i.e., extra strains or lacking strains), which usually is a symptom or root trigger. That is straight a results of the only program, a number of information nature of manufacturing coaching jobs, the place each rank iterates on information batches with the identical code (with batch-level boundaries).

Determine 2: Logarithm UI options for coaching methods debugging (Logs proven are for demonstration functions).

Logarithm ingests steady mannequin telemetry and abstract statistics that span mannequin enter and output tensors, mannequin properties (e.g., studying charge), mannequin inner state tensors (e.g., neuron activations) and gradients throughout coaching. This powers dwell coaching mannequin monitoring dashboards resembling an inner deployment of TensorBoard, and is utilized by ML engineers to debug mannequin convergence points and coaching failures (because of gradient/loss explosions) utilizing notebooks on uncooked telemetry.

Mannequin telemetry tends to be iteration-based tensor timeseries with dimensions (e.g., mannequin structure, neuron, or module names), and tends to be high-volume and high-throughput (which makes low-cost ingestion in Logarithm a pure alternative). Collocating methods and mannequin telemetry allows debugging points that cascade from one layer to the opposite. The mannequin telemetry APIs internally write timeseries and dimensions as typed key-value pairs utilizing the Logarithm metadata API. Multimodal information (e.g., pictures) are captured as references to information written to an exterior blob retailer.

Mannequin telemetry dashboards sometimes are usually numerous timeseries visualizations organized in a grid – this allows ML engineers to eyeball spatial and temporal dynamics of the mannequin exterior and inner state over time and discover anomalies and correlation construction. A single dashboard therefore must get a considerably giant variety of timeseries and their tensors. With a purpose to render at interactive latencies, dashboards batch and fan out queries to Logarithm utilizing the streaming API. The streaming API returns outcomes with random ordering, which allows dashboards to incrementally render all plots in parallel – inside 100s of milliseconds to the primary set of samples and inside seconds to the total set of factors.

Determine 3: TensorBoard mannequin telemetry dashboard powered by Logarithm. Renders 722 metric time sequence directly (whole of 450k samples).

Logarithm’s system structure

Our purpose behind Logarithm is to construct a extremely scalable and fault tolerant system that helps high-throughput ingestion and interactive question latencies; and gives robust ensures on availability, sturdiness, freshness, completeness, and question latency.

Determine 4: Logarithm’s system structure.

At a excessive stage, Logarithm contains the next elements:

  1. Software processes emit logs utilizing logging APIs. The APIs help emitting unstructured log strains together with typed metadata key-value pairs (per-line).
  2. A bunch-side agent discovers the format of strains and parses strains for frequent fields, resembling timestamp, severity, course of ID, and callsite.
  3. The ensuing object is buffered and written to a distributed queue (for that log stream) that gives sturdiness ensures with days of object lifetime.
  4. Ingestion clusters learn objects from queues, and help extra parsing primarily based on any user-defined regex extraction guidelines – the extracted key-value pairs are written to the road’s metadata.
  5. Question clusters help interactive and bulk queries on a number of log streams with predicate filters on log textual content and metadata.

Logarithm shops locality of knowledge blocks in a central locality service. We implement this on a hosted, extremely partitioned and replicated assortment of MySQL situations. Each block that’s generated at ingestion clusters is written as a set of locality rows (one for every log stream within the block) to a deterministic shard, and reads are distributed throughout replicas for a shard. For scalability, we don’t use distributed transactions because the workload is append-only. Observe that because the ingestion processing throughout log streams isn’t coordinated by design (for scalability), federated queries throughout log streams might not return the identical last-logged timestamps between log streams.

Our design decisions focus on layering storage, question, and log analytics and ease in state distribution. We design for 2 frequent properties of logs: they’re written greater than queried, and up to date logs are usually queried greater than older ones.

Design choices

Logarithm shops logs as blocks of textual content and metadata and maintains secondary indices to help low latency lookups on textual content and/or metadata. Since logs quickly lose question probability with time, Logarithm tiers the storage of logs and secondary indices throughout bodily reminiscence, native SSD, and a distant sturdy and extremely out there blob storage service (at Meta we use Manifold). Along with secondary indices, tiering additionally ensures the bottom latencies for probably the most accessed (current) logs.

Light-weight disaggregated secondary indices. Sustaining secondary indices on disaggregated blob storage magnifies information lookup prices at question time. Logarithm’s secondary indices are designed to be light-weight, utilizing Bloom filters. The Bloom filters are prefetched (or loaded on-query) right into a distributed cache on the question clusters when blocks are printed on disaggregated storage, to cover community latencies on index lookups. We later added help for information blocks within the question cache when executing a question. The system tries to collocate information from the identical log stream with the intention to cut back fan outs and stragglers throughout question processing. The logs and metadata are carried out as ORC information. The Bloom filters presently index log stream locality and metadata key-value info (i.e., min-max values and Bloom filters for every column of ORC stripes).

Logarithm separates compute (ingestion and question) and storage to quickly scale out the quantity of log blocks and secondary indices. The exception to that is the in-memory memtable on ingestion clusters that buffer time-ordered lists of log streams, which is a staging space for each writes and reads. The memtable is a bounded per-log stream buffer of the newest and lengthy sufficient time window of logs which might be anticipated to be queried. The ingestion implementation is designed to be I/O-bound and never compute or host reminiscence bandwidth-heavy to deal with near GB/s of per-host ingestion streaming. To reduce memtable rivalry, we implement a number of memtables, for staging, and an immutable prior model for serializing to disk. Ingestion implementation follows zero-copy semantics.

Equally, Logarithm separates ingestion and question assets to make sure bulk processing (ingestion) and interactive workloads don’t affect one another. Observe that Logarithm’s design makes use of schema-on-write, however the information mannequin and parsing computation is distributed between the logging hosts (which scales ingestion compute), and optionally, the ingestion clusters (for user-defined parsing). Clients can add extra anticipated capability for storage (e.g., elevated retention limits), ingestion and question workloads.

Logarithm pushes down distributed state upkeep to disaggregated storage layers (as an alternative of replicating compute at ingestion layer). The disaggregated storage in Manifold makes use of read-write quorums to supply robust consistency, sturdiness and availability ensures. The distributed queues in Scribe use LogDevice for sustaining objects as a sturdy replicated log. This simplifies ingestion and question tier fault tolerance. Ingestion nodes stream serialized objects on native SSDs to Manifold in 20-min. epochs, and checkpoint Scribe offsets on Manifold. When a failed ingestion node is changed, the brand new node downloads the final epoch of knowledge from Manifold, and restarts ingesting uncooked logs from the final Scribe checkpoint.

Ingestion elasticity. The Logarithm management aircraft (primarily based on Shard Supervisor) tracks ingestion node well being and log stream shard-level hotspots, and relocates shards to different nodes when it finds points or load. When there is a rise in logs written in a log stream, the management aircraft scales out the shard rely and allocates new shards on ingestion nodes with out there assets. The system is designed to supply useful resource isolation at ingestion-time between log streams. If there’s a vital surge in very brief timescales, the distributed queues in Scribe take up the spikes, however when the queues are full, the log stream can lose logs (till elasticity mechanisms enhance shard counts). Such spikes sometimes are likely to outcome from logging bugs (e.g., verbosity) in utility code.

Question processing. Queries are routed randomly throughout the question clusters. When a question node receives a request, it assumes the function of an aggregator and partitions the request throughout a bounded subset of question cluster nodes (balancing between cluster load and question latency). The aggregator pushes down filter and type operators to question nodes and returns sorted outcomes (an end-to-end blocking operation). The question nodes learn their partitions of logs by wanting up locality, adopted by secondary indices and information blocks – the learn can span the question cache, ingestion nodes (for most up-to-date logs) and disaggregated storage. We added 2x replication of the question cache to help question cluster load distribution and quick failover (with out ready for cache shard motion). Logarithm additionally gives a streaming question API with randomized and incremental sampling that returns filtered logs (an end-to-end non-blocking operation) for lower-latency reads and time-to-first-log. Logarithm paginates outcome units.

Logarithm can tradeoff question outcome completeness or ordering to take care of question latency (and flag to the consumer when it does so). For instance, this may be the case when a partition of a question is sluggish or when the variety of blocks to be learn is simply too excessive. Within the former, it instances out and skips the straggler. Within the latter state of affairs, it begins from skipped blocks (or offsets) when processing the subsequent outcome web page. In apply, we offer ensures for each outcome completeness and question latency. That is primarily possible because the system has mechanisms to scale back the probability of root causes that result in stragglers. Logarithm additionally does question admission management at consumer or user-level.

The next figures characterize Logarithm’s mixture manufacturing efficiency and scalability throughout all log streams. They spotlight scalability on account of design decisions that make the system easier (spanning disaggregation, ingestion-query separation, indexes, and fault tolerance design). We current our manufacturing service-level targets (SLOs) over a month, that are outlined because the fraction of time they violate thresholds on availability, sturdiness (together with completeness), freshness, and question latency.

Determine 5: Logarithm’s ingestion-query scalability for the month of January 2024 (one level per day).
Determine 6: Logarithm SLOs for the month of January 2024 (one level per day).

Logarithm helps robust safety and privateness ensures. Entry management may be enforced on a per-log line granularity at ingestion and query-time. Log streams can have configurable retention home windows with line-level deletion operations.

Subsequent steps

Over the previous couple of years, a number of use instances have been constructed over the foundational log primitives that Logarithm implements. Programs resembling relational algebra on structured information and log analytics are being layered on high with Logarithm’s question latency ensures – utilizing pushdowns of search-filter-sort and federated retrieval operations. Logarithm helps a local UI for interactive log exploration, search, and filtering to help debugging use instances. This UI is embedded as a widget in service consoles throughout Meta companies. Logarithm additionally helps a CLI for bulk obtain of service logs for scripting analyses.

The Logarithm design has centered round simplicity for scalability ensures. We’re repeatedly constructing domain-specific and agnostic log analytics capabilities inside or layered on Logarithm with applicable pushdowns for efficiency optimizations. We proceed to put money into storage and query-time enhancements, resembling light-weight disaggregated inverted indices for textual content search, storage layouts optimized for queries and distributed debugging UI primitives for AI methods.

Acknowledgements

We thank Logarithm crew’s present and previous members, significantly our leads: Amir Alon, Stavros Harizopoulos, Rukmani Ravisundaram, Laurynas Sukys, Hanwen Zhang; and our management: Vinay Perneti, Shah Rahman, Nikhilesh Reddy, Gautam Shanbhag, Girish Vaitheeswaran, and Yogesh Upadhay. Thanks to our companions and clients: Sergey Anpilov, Jenya (Eugene) Lee, Aravind Ram,  Vikram Srivastava, and Mik Vyatskov.