July 17, 2024
Tracing Notifications – Slack Engineering

Notifications are a key facet of the Slack person expertise. Customers depend on well timed notifications of mentions and DMs to maintain on high of essential info. Poor notification completeness erodes the belief of all Slack customers. 

Notifications move by virtually all of the methods in our infrastructure. As illustrated in Determine 1 beneath, a notification request flows by the webapp (our utility logic and internet / Desktop consumer monorepo), job queue, push service, and a number of other third-party companies earlier than hitting our iOS, Android, Desktop, or internet shoppers.

Additional, the choice about when and the place to ship a notification can be very sophisticated, as proven in Determine 2 beneath, which is from our 2017 weblog put up (additionally summarized here).

Since 2017, our notification workflow has solely grown extra complicated, by the addition of recent options like Huddles and Canvas. Because of this, fixing notification points can result in multi-day debugging periods throughout a number of groups. Buyer tickets associated to notifications additionally had the bottom NPS scores and took the longest time to resolve in comparison with different buyer points.

Debugging notification points inside our methods was tough as a result of every system had a distinct logging pipeline and knowledge format, making it vital to take a look at knowledge with totally different codecs and backends. This course of required deep technical experience and took a number of days to finish. The context by which occasions had been logged additionally diverse throughout methods, prolonging any investigations. This resulted in a time-consuming course of requiring experience in all elements of the stack simply to grasp what occurred.

We started a mission to hint the move of notifications throughout our methods to handle these challenges. The objective was to standardize the info format and semantics of occasions to make it simpler to grasp and debug notification knowledge. We needed to reply questions on notifications comparable to: if it was despatched, the place it was despatched, if it was seen, and if the person had opened it. This put up paperwork our multi-quarter, cross-organizational journey of tracing notifications all through Slack’s backend methods, and the way we use this hint knowledge to enhance the Slack buyer expertise for everybody.

Notification move

The sequence of steps to grasp how notifications had been despatched and obtained is one thing we’ve dubbed the “notification move.” Step one to enhance the notification move was to mannequin the steps within the notification course of the identical means throughout all our shoppers. We additionally aimed to seize all occasions in a standard knowledge mannequin persistently in the identical format.

We created a notification spec to grasp all of the occasions in a notification hint. This concerned figuring out all of the occasions in a hint, creating an idealized funnel, and setting the context by which every occasion might be logged. We additionally needed to agree on the semantics of a span and the names of the occasions, which was a difficult process throughout totally different platforms. The result’s a notification move (simplified for this weblog put up), proven within the picture beneath.

Mapping notification move to a hint

After we completed planning the move of our system, we would have liked to choose a solution to maintain observe of that info. We selected to make use of SlackTrace as a result of a hint was a pure solution to signify a move, and all of the elements of our system can already ship info within the span occasion format. Nevertheless, we encountered two main challenges when modeling notification flows as traces.

  1. 100% sampling for notification flows: In contrast to backend requests—which had been sampled at 1%—notification flows shouldn’t be sampled since our CE crew needed 100% constancy to reply all buyer requests. In some situations like `@right here` and `@channel`, a push notification message can be probably despatched to a whole lot of 1000’s of customers throughout a number of units, leading to billions of spans for a single hint of a slack message. A hint with probably billions of spans would wreak havoc on our hint ingestion pipeline and storage backends. No sampling would additionally drive us to hint each Slack message despatched.
  2. Tracing notifications as a move separate from the unique message despatched hint. Presently, OpenTelemetry (OpenTracing) instrumentation tightly {couples} tracing to a request context. In a notification move, this tight coupling would break for the reason that notification move executes in a number of contexts and doesn’t cleanly map to a single request context. Additional, mixing a number of hint contexts additionally made implementing tracing throughout our code difficult.

To unravel each of those challenges we determined to mannequin every notification despatched as its personal hint. To tie the sender’s hint to every of the notifications despatched, we used span links to causally hyperlink the spans collectively. Every notification was assigned a notification_id which was used as a trace_id for the notification move.

This method has a number of benefits: 

  • Since SlackTrace’s instrumentation doesn’t tightly couple hint context propagation with request context propagation, modeling these flows drastically simplifies the hint instrumentation.
  • Since every notification despatched was its personal hint, it made the traces smaller and simpler to retailer and question.
  • It allowed 100% sampling for notification traces, whereas preserving the senders sampling charge at 1%.
  • Span linking helped us protect causality for the hint knowledge.

Totally different groups labored collectively to map the steps within the notification move to a span. The result’s a desk as proven beneath.

Span title Description Hint id Guardian span id Span tags
notification:set off  Decide if the notification ought to be despatched or not.  Trace_id is the request id. Span hyperlinks have an inventory of notification_id’s despatched. trigger_type (DM, @right here, @channel), user_id, team_id channel_id message_ts notification_id
notification:notify  Notify the person on all of their shoppers.  Trace_id is notification_id. Id of notification:set off span. user_id, team_id channel_id message_ts
notification:despatched Notification is distributed to a slack consumer to all of the a number of slack shoppers on the person’s system.  Trace_id is notification_id ID of notification:notify channel_id platform particular notification  tags.
notification:obtained Notification is obtained on the person’s slack consumer.  Trace_id is notification_id ID of notification:despatched span. Service title is consumer title and consumer tags.
notification:opened  Consumer opened a notification on the system.  Trace_id is notification_id ID of notification:obtained span. Service title is consumer title and consumer tags.
notification:learn in app Consumer clicked on the notification to view the notification within the app.The beginning of the span is correct after opening. The top of the span is when the message is rendered within the channel. Trace_id is notification_id ID of notification:opened span. Service title is consumer title and consumer tags.

Benefits of modeling a notification move as a hint

Representing the notification move as a Hint/SpanEvent has the next benefits over our current strategies.

  • Constant knowledge format: Since all of the companies reported the info as a Span, the info from numerous backend and consumer methods was in the identical format.
  • Service title to establish supply: We set the service title subject to Desktop, iOS, or Android to uniquely establish the consumer or service that generated an occasion. 
  • Normal names for contexts: We used the span title and repair title to uniquely establish an occasion throughout methods. For instance, the service title for a notification :obtained occasion can be iOS, Android and Net to precisely tag these occasions. Beforehand, the occasions from these three shoppers would have totally different codecs and it was laborious to uniformly question them. 
  • Standardized timestamps and length fields: All of the occasions have a constant timestamp in the identical decision and time zone as the remainder of the occasions. If there’s a length related to an occasion, we set the length subject or set it to a default worth of 1 when reporting a one-off occasion. This supplied a single place for storing all of our length info. 
  • Constructed-in periods: We’d use the notification ID because the hint ID for your complete move. Because of this all of the occasions in a move are already sessionized and there’s no have to additional sessionize the info. For instance, we couldn’t use the notification ID because the be part of key in every single place since just some occasions would have a notification ID. For instance, the notification triggered of a notification learn occasion wouldn’t have a notification ID in them. We will use the hint ID to tie these occasions collectively as an alternative of utilizing bespoke occasions.
  • Clear, easy, and dependable instrumentation: Since a hint is sessionized, we solely want so as to add the tags to the hint as soon as after we mannequin the notification move as a hint. This additionally made the instrumentation code cleaner, easier, and dependable for the reason that modifications had been localized to small elements of the code that may be unit examined nicely. It additionally made the info simpler to make use of since there is just one be part of key as an alternative of bespoke be part of key for some subset of occasions.
  • Versatile knowledge mannequin: This mannequin can be versatile and extendable. If a consumer wants so as to add further context, they’ll add further tags to an current span. If not one of the current spans are a very good match, they’ll add a brand new span to the hint, with out altering the present hint knowledge or hint queries.
  • No duplicate occasions: The SpanID within the occasion helped seize the individuality of occasions at supply. This diminished the variety of occasions that had been double reported and eliminated the necessity to de-dupe occasions in our backend once more. The older methodology reported thrift objects with out distinctive IDs which led to utilizing de-dupe jobs to establish double reporting of occasions.
  • Span linking for tying associated traces collectively: Linking spans throughout traces helps protect causality with out resorting to advert hoc knowledge modeling.

How we use notification hint knowledge at Slack

After a number of quarters of laborious work by a number of groups we had been capable of hint notifications end-to-end throughout all of the Slack shoppers. Our traces had been despatched to a real-time retailer and our knowledge warehouse utilizing the hint ingestion pipeline.

Builders use the notification hint knowledge to triage points. Beforehand, monitoring notification failures concerned going by logs of a number of methods to grasp the place a notification was dropped. This course of was concerned and took a number of hours of very senior engineers’ time to grasp what went on. Nevertheless, after notification tracing, anybody was ready to take a look at a hint of the notification to exactly see the place a hint was despatched and the place within the move a notification was dropped.

Our buyer expertise crew makes use of hint knowledge to triage buyer points a lot sooner today. We now know exactly the place within the notification move a message dropped. Since our traces are simpler to learn, our CE engineers can have a look at a hint to be taught what occurred in a notification to reply a buyer’s question as an alternative of escalating it to the event crew, who then needed to comb by the numerous logs. This helped us triage our notifications rather more shortly, and diminished the time to triage notification tickets for our CE crew by 30%.

Notification analytics

Presently, we ingest notification hint knowledge to ElasticSearch/Grafana and our knowledge warehouse.  

Our iOS engineers and Android engineers have began utilizing this knowledge to construct Grafana dashboards and alerts to grasp the efficiency of our shoppers. Usually, consumer engineers don’t use dashboarding instruments like Grafana, however our consumer engineers have used them very successfully to triage and debug points in our notification move.

We have now additionally ingested this knowledge into our knowledge warehouse, over which anybody can run complicated analytics on this knowledge. Initially knowledge scientists used this knowledge to grasp efficiency regressions in our shoppers over lengthy durations of time.

The span occasion format and tracing system additionally has an sudden profit. Our knowledge scientists used this knowledge to construct a product analytics dashboard exhibiting funnel analytics on notification flows, to raised perceive notification open charges. Usually, that product analytics knowledge can be captured by a separate set of instrumentation ingested through a distinct pipeline into the info warehouse. Nevertheless, since we despatched the hint knowledge to the info warehouse, our knowledge scientists can use it to compute funnel analytics on the info to get the identical insights. 

An much more extraordinary final result was when the info scientists had been capable of mine the hint knowledge to establish and report bugs in utility and instrumentation. Previously two years since, notification traces had been used many occasions outdoors of the preliminary use case. This exhibits the benefits of utilizing hint knowledge as a single supply of reality, resulting from its help for a number of use circumstances.

Conclusion

Modeling flows or funnels as a hint is a good thought, however there are some challenges. On this weblog put up now we have proven how Slack modeled notification flows as traces, the challenges we confronted, and overcome these challenges by cautious modeling.

Implementing notification tracing wouldn’t have been attainable with out decoupling the hint context propagation from a request context within the SlackTrace framework. The instrumentation helped us shortly and cleanly implement tracing throughout a number of backend companies, whereas avoiding the detrimental unwanted side effects of current libraries, comparable to cluttered instrumentation and enormous traces. Presently, we instrument a number of different flows within the manufacturing Slack app utilizing the identical technique. 

Modeling notification flows as hint knowledge helped our CE crew resolve notification points 30% sooner whereas additionally lowering escalations to the event crew.

Along with the unique use case of debugging notification points, notification hint knowledge was additionally used for calculating funnel analytics for manufacturing analytics use circumstances. Modeling product analytics knowledge as traces gives high-quality knowledge in a constant knowledge format throughout all of our complicated stack. Additional, the built-in sessionization of hint knowledge simplified our analytics pipeline by eliminating further jobs to de-dupe and sessionize the hint knowledge. Previously two years, backend and frontend builders and knowledge scientists have used the hint knowledge as a single supply of reality for a number of use circumstances. 

The success of notification tracing has inspired a number of different use circumstances the place flows are modeled as traces at Slack. As we speak within the Slack app there are a minimum of a dozen tracers operating concurrently within the Slack app.

Keen on taking up fascinating tasks, making folks’s work lives simpler, or optimizing some code? We’re hiring! 💼 Apply now