April 24, 2024

Liang Ma | Software program Engineer, Core Eng; Wei Zhu | Software program Engineer, Observability

Flow map: Pinterest app to JSON logs batch to Logservice — a batch logging endpoint (/log) that handles perf logs, device info(Android) and new JSON log type to json messages to Singer to Pub/Sub with arrows to Logstash, Merced and Other analytics tools. Logstash goes to Open Search. Open Search goes to OpenSearch Dashboards and Metric generator to Statsboard. Merced goes to S3/Hive.

In early 2020, throughout a important iOS out of reminiscence incident (now we have a blogpost for that), we realized that we didn’t have a lot visibility of how the app is operating or a superb system to lookup for monitoring and troubleshooting.

At the moment, on the shopper facet, there have been a couple of methods for logging of their each day work:

  • Context logging: constructed for logging and reporting impressions or something associated to enterprise, thus a time important and first-class endpoint. Builders must explicitly outline keys that will in any other case be rejected by the endpoint. Some corporations name it “analytics logging.”
  • Misc: logging to an area file on disk, and even logging to a crash monitoring service as an error sort.

The issues are:

  • Not all logs fall into these classes, and other people usually abuse sure kinds of logging
  • None of those instruments present a great way to visualise or mixture. For instance, builders must make code adjustments to populate data like “what the metric appears like on app model A, on system B, and beneath community sort C”
  • There isn’t a system that may simply monitor logs in a real-time means, to not point out arrange real-time alerts with log-based customized metrics.

We determined to create an end-to-end pipeline with the next traits:

  • It’s constructed with the least resistance: log payload is schemaless and versatile, principally key-value pairs. That’s one of many causes we name it JSON logging.
  • It’s prepared to make use of logging APIs on every platform
  • Builders don’t want to the touch any backend stuff
  • It’s straightforward to question and visualize logs
  • Performs in real-time!

With these in thoughts, the next key design choices had been made:

  • The logging service endpoint will deal with logs validating, parsing, and processing.
  • Logs will likely be endured in hive, thus supporting any SQL-based queries.
  • A single and shared Kafka subject will likely be used for all logs going by means of this pipeline.
  • It’s built-in with OpenSearch (Amazon’s fork of Elasticsearch and Kibana) as an actual time visualization and question device.
  • It will likely be straightforward to arrange real-time alerting with log-based customized metrics.

Excessive stage

Flow map: Pinterest app to JSON logs batch to Logservice — a batch logging endpoint (/log) that handles perf logs, device info(Android) and new JSON log type to json messages to Singer to Pub/Sub with arrows to Logstash, Merced and Other analytics tools. Logstash goes to Open Search. Open Search goes to OpenSearch Dashboards and Metric generator to Statsboard. Merced goes to S3/Hive.
Determine 1 — structure of the logging pipeline

Schema

Consumer facet service integration will present the metadata, and builders simply want to offer the identify of the log and precise log payload. Nothing else is required.

A pattern payload

 “name” = “network_metrics”; //required, set by users “timestamp” = 2022121512345; //required, set by pipeline “metadata” =  //required, set by pipeline “app_version” = “8.40”; “os_version” = “14.0”; “device_model” = “IPHONE11,2; “build_type” = “Production” // “OTA”, “Development”, “Alpha”, etc “network_type” = “wifi” // or “cellular” “country” = “United States”; “platform” = “Android”; … ; “payload” =  // users reported payload will appear here ; ;

Visualize and question

Visualization of logs on Opensearch is comparatively easy following the self-service steering supplied for this pipeline. Additionally, builders can use SQL question and every other question/visualization instruments which are supported by this pipeline to question.

Example on how to visualize network metrics in real-time with six separate graphics: mobile_json)log::platforms, mobile_networking::host, mobile_json_log::total_count_timeline, mobile_networking::req_num_by_ver, mobile_networking::request_latency, and mobile_networking::status.
Determine 2 — a pattern dashboard of community logs from each iOS and Android apps

Actual-time alerting

Log-based metrics are a cost-efficient technique to summarize log information from the whole ingest stream. With log-based metrics, customers can generate a rely metric of logs that match a Lucene question. For extra superior use circumstances, customers can generate metrics from an OpenSearch time period aggregation question to dissect log information throughout completely different dimensions.

Example on how to create a log-based metric. “Succeeded. Metric Name: es.mobile_json.story_pin_by_event_type. Query Name: name: story_pin_creation_event AND metadata.build_type:Production. Index Name: mobile_json_log. Begin: -30mins. End: -5min. Term Aggs (optional) Field: payload.eventType.key. Tag Key: event_type. Size: 10. Order: desc. Field: metadata.platform.key. Tag Key: platform. Size: 10. Order: desc.”
Determine 3 — instance: the way to create a log-based metric

Log-based metrics can be utilized to construct dashboards and real-time alerts:

Title of Tab: ES Mobile JSON Story Pin Event. sum_aggregator: zimsum:1m-avg-none. Two Stratsboards with red lines and dots titled “iOS story pin event SR” and “Android story pin event SR”.
Determine 4 — instance: a real-time alerting arrange primarily based on the log-based metric, on Statsboard

Since this pipeline was constructed up with none actual push, builders have been proactively adopting this logging system primarily for:

Consumer visibility

  • Networking metrics and crash metrics in order that they know higher how the purchasers carry out and get that shopper facet indicators to the topline Pinner Uptime metric
  • Efficiency perception, equivalent to data supplied by iOS MetricKit
  • Customized error reporting, equivalent to exceptions, smooth errors, and assertions that had been beforehand both not reported or reported someplace and didn’t have a superb device to investigate

Product floor/function SLA

  • Some product groups leverage this technique to report product function well being, equivalent to Pin creation outcomes, to allow them to monitor success/failure charges in real-time. This usually catches points means sooner than the same old each day metric aggregation, and it’s particularly helpful for points that API facet monitoring wouldn’t alert immediately.

Developer logs

  • Builders like to make use of this pipeline to achieve visibility of sure logic or code paths on manufacturing, e.g. “has this code ever run?,”, “how usually does this occur?”, and plenty of related questions that nobody can reply besides the information.
  • Builders add logs to assist troubleshoot odd bugs which are very laborious to breed domestically or points that solely happen on sure system fashions, OS variations, and so on.

Actual Time alerting

  • Due to the benefit of reporting and alerting setup, product groups usually use that only for the sake of real-time alerting.
  • On the Opensearch facet, create sub-level indexes by identify, which might increase question efficiency and in addition higher isolate logs
  • Discover the alerting operate supplied by Opensearch

Acknowledgements: big because of Stephen Blanco, Darren Gyles, Sha Sha Chu, Nadine Harik, Roger Wang, and our information & infra staff for his or her contribution, suggestions and assist.

To study extra about engineering at Pinterest, take a look at the remainder of our Engineering Weblog and go to our Pinterest Labs website. To discover life at Pinterest, go to our Careers web page.