April 13, 2024

At Slack, the purpose of the Cellular Developer Expertise Staff (DevXp) is to empower builders to ship code with confidence whereas having fun with a nice and productive engineering expertise. We use metrics and surveys to measure productiveness and developer expertise, resembling developer sentiment, CI stability, time to merge (TTM), and check failure price.

We now have gotten loads of worth out of our give attention to cell developer expertise, and we expect most corporations under-invest on this space. On this publish we are going to focus on why having a DevXp staff improves effectivity and happiness, the price of not having a staff, and the way the staff recognized and resolved some widespread developer ache factors to optimize the developer expertise.

How it began

A couple of cell engineers realized early on that engineers who have been employed to put in writing native cell code may not essentially have experience within the technical areas round their developer expertise. They thought that if they might make the developer expertise for all cell engineers higher, they might not solely assist engineers be extra productive, but additionally delight our clients with sooner, higher-quality releases. They bought collectively and fashioned an ad-hoc staff to handle the most typical developer ache factors. The cell developer expertise staff has grown from three folks in 2017 to eight folks at present. In our 5 years as a staff, we’ve centered on these areas:

  • Native growth expertise and IDE usability
  • Our rising codebase. Making certain visibility into problematic areas of the codebase that require consideration
  • Steady Integration usability and extensibility
  • Automation check infrastructure and automatic check flakiness
  • Maintaining the principle department inexperienced. Ensuring the most recent primary is at all times buildable and shippable

The price of not investing in a cell developer expertise staff

A cell engineer often begins a characteristic by making a department on their native machine and committing their code to GitHub. When they’re prepared, they create a pull request and assign it to a reviewer. As soon as a pull request is opened or a subsequent commit has been added to the department, the next CI jobs get kicked off:

  • Jobs that construct artifacts
  • Jobs that run exams
  • Jobs that run static evaluation

As soon as the reviewer approves the pull request and all checks cross on CI, the engineer may merge the pull request in the principle department. Right here is the visualization of the developer circulate and the circulate interruptions related to every space.

Here’s a tough estimate of the price of some developer ache factors and the fee to the corporate for not addressing these ache factors because the staff grows:

Whereas builders can study to resolve a few of these points, the time spent and the fee incurred isn’t justifiable because the staff grows. Having a devoted staff that may give attention to these downside areas and figuring out methods to make the developer groups extra environment friendly will make sure that builders can keep an intense product focus.

Strategy

Our staff companions with the cell engineering groups to prioritize which developer ache factors to give attention to, utilizing the next method:

  • Take heed to clients and work alongside them. We are going to associate with a cell engineer as they’re engaged on a characteristic and observe their challenges.
  • Survey the builders. We conduct a quarterly survey of our cell engineers the place we monitor common Web Promoter Rating (NPS) round cell growth.
  • Summarize developer ache factors. We distill the suggestions into working areas that we are able to break up up as a staff and sort out.
  • Collect metrics. It can be crucial that we measure earlier than we begin addressing a ache level to make sure that an answer we deploy really fixes the difficulty, and to know the precise influence our resolution had on the issue space. We provide you with metrics to trace that correlate with the issue areas builders have and monitor them on dashboards. This permits us to see the metrics change over time.
  • Spend money on experiments that enhance developer ache factors. We are going to consider options to the issues by both consulting with different corporations that additionally develop at this scale, or by arising with a singular resolution ourselves.
  • Think about using third-party instruments. We consider whether or not it makes extra sense to make use of current options or to construct out our personal options.
  • Repeat this course of. As soon as we launch an answer, we have a look at the metrics to make sure that it strikes the needle in the best route; solely then can we transfer onto the following downside space.

Developer pains

Let’s dive into some developer ache factors so as of severity and look at how the cell developer expertise staff addressed them. For every ache level, we are going to begin with some quotes from our builders after which define the steps we took.

CI check jobs that take a very long time to finish

When a developer has to attend a very long time for exams to run on their pull requests, they swap to engaged on a special activity and lose context on the unique pull request. When the check outcomes return, if there is a matter they should tackle, they should re-orient themselves with the unique activity they have been engaged on. This context switching takes a toll on developer productiveness. The next are two quotes from our quarterly cell engineering survey in 2018.

 

Quicker CI time! I believe that is requested rather a lot, however it could be wonderful to have this improved

Jenkins construct instances are fairly excessive and it could be nice if we are able to cut back these

From 1 to 10 builders, we had a few hundred exams and ran all of them serially utilizing Xcodebuild for iOS and Firebase Check lab for Android.

Operating the exams serially labored for a few years, till the check job time began to take nearly an hour. One of many options we thought-about was introducing parallelization to the check suites. As a substitute of working the entire exams serially, we may break up them into shards and run them in parallel. Right here is how we solved this downside on the iOS and Android platforms.

iOS 

We thought-about writing our personal software to realize this, however then found a software referred to as Bluepill that was open sourced by Linkedin. It makes use of Xcodebuild underneath the hood, however added the power to shard and execute exams in parallel. Integrating Bluepill decreased our complete check execution time to about 20 minutes.

Utilizing Bluepill labored for a couple of extra years till our unit check job began to as soon as once more take nearly 50 minutes. Slack iOS engineers have been including extra check suites to run, and we may not merely rely solely on parallelization to decrease TTM.

How transferring to a contemporary construct system helped drive down CI job instances

Our subsequent technique was to implement a caching layer for our check suites. The purpose was to solely run the exams that wanted to be run on a selected pull request, and return the remaining check outcomes from cache. The issue was that Xcodebuild doesn’t assist caching. To implement check caching we would have liked to maneuver to a special construct system:s Bazel. We utilized Bazel’s disk cache on CI machines so builds from completely different pull requests can reuse construct outputs from one other person’s construct moderately than constructing every new output regionally.

Along with the Bazel disk cache, we use the bazel-diff software that permits us to find out the precise affected set of impacted targets between two Git revisions. The 2 revisions we evaluate are the tip of the principle department, and the final commit on the builders department. As soon as we’ve the listing of targets that have been impacted, we solely check these targets.

With the Bazel construct system and bazel-diff, we have been capable of lower TTM to a median of 9 minutes, with a minimal TTM  of 4.5 minutes. This implies builders can get the suggestions they want on their pull request sooner, and extra shortly get again to collaborating with others and dealing on their options.

Android 

Within the early days, TTM was round 50 minutes, and Firebase Check Lab (FTL) didn’t have check sharding.  We constructed an in-house check sharder on high of FTL referred to as Gasoline to interrupt exams into a number of shards and name FTL APIs to run every check shard in parallel. This introduced TTM from 50+ minutes to underneath 20 minutes. Right here is the excessive degree overview:

We continued utilizing Gasoline for 2 and a half years, after which moved to an open supply check sharder referred to as Flank. We proceed to make use of Flank at present to run Android purposeful and end-to-end UI exams.

Check-related failures 

When a examine fails on a pull request due to flaky or unrelated check failures, it has the potential to take the developer out of circulate, and presumably influence different builders as properly. Let’s check out a couple of culprits inflicting non-related pull request failures and the way we’ve addressed them.

Fragile automation frameworks

From 2015 to early 2017, we used the Calabash testing framework that interacted with the UI and wrapped that logic in Cucumber to make the steps human readable. Calabash is a “blackbox” check automation framework and desires a devoted automation staff to put in writing and handle exams. We noticed that the extra exams that have been added, the extra fragile the check suites turned. When a check failed on a pull request, the developer would attain out to an Automation Engineer to grasp the failure, try to repair it, then rerun it once more and hope that one other fragile check doesn’t fail their construct. This resulted in an extended suggestions loop and elevated TTM.

Because the staff grew we determined to maneuver away from Calabash and switched to Espresso as a result of Espresso was tightly coupled with the Android OS and can also be written within the native language (Java or Kotlin). Espresso is highly effective as a result of it’s conscious of the inside workings of the Android OS and will interface with it simply. This additionally meant that Android builders may simply write and modify exams as a result of they have been written within the language they have been most snug with. A couple of advantages to spotlight with migrations:

  • This helped to shift testing accountability from our devoted automation staff to builders, to allow them to write exams as wanted to check the logic within the code
  • Testing time went from ~350 minutes to ~60 minutes after we moved from Calabash to Espresso and FTL

Flaky exams

In early 2018 the developer sentiment in direction of testing was poor and triggered loads of developer ache. Listed here are couple of quotes from our developer survey:

 

Flimsy exams are nonetheless a bottleneck generally. We must always have a greater approach monitoring them and ping the proprietor to repair earlier than it causes an excessive amount of friction

Flaky exams sluggish me all the way down to a halt – there must be a extra streamlined course of in place for continuing with PR’s as soon as flaky exams are discovered (as an alternative of blocking a merge because it occurs now)

At one level, 57% of the check failures in our primary department have been as a result of flaky exams and the share was even larger on developer pull requests. We spent a while studying about flaky exams and managed to get them underneath management in recent times by constructing a system to auto-detect and suppress flaky exams to make sure developer expertise and circulate is uninterrupted. Here’s a detailed article outlining our method and the way we diminished check failures price from 57% to 4% 

CI-related failures

For a few years we used Jenkins to energy the cell CI infrastructure, utilizing Groovy-based .jenkinsfiles. Whereas it labored, it was additionally the supply of loads of frustration for builders. These issues have been essentially the most impactful:

  • Frequent downtime
  • Diminished efficiency of the system
  • Failure to select up Git webhooks, and subsequently not beginning pull request CI jobs
  • Failure to replace the pull request when a job fails
  • Problem in debugging failures as a result of poor UX

After flaky exams, CI downtime was the most important bottleneck negatively impacting the cell staff’s productiveness. Listed here are some quotes from our builders concerning Jenkins:

 

Want extra dependable hooks between the jenkins CI and GitHub. When issues do go fallacious, there are generally no hyperlinks in GH to go to the best place. Additionally, generally CI passes however does not report again to GH so PR is caught in limbo till I manually rebuild stuff

Jenkins is a ache. Take away the Blue Ocean jenkins UI that’s complicated and everybody hates

Jenkins is a large number to me. There are too many hyperlinks and I solely care about what broke and what button/hyperlink do I have to click on on to retry. Every part else is noise

After utilizing Jenkins for greater than six years, we migrated away from it to BuildKite, which has had 99.96% uptime thus far. Webhook-related points have utterly disappeared, and the UX is easy sufficient for builders to navigate while not having our staff’s assist. This has not solely improved developer expertise but additionally decreased the triage load for our staff.

The fast influence of the migration was an 8% improve in CI stability from ~87% to 95%  and diminished Time to Merge by 41% from ~34 minutes to ~20 minutes

Merge conflicts

Battle whereas including new modules or recordsdata to the Xcode venture for iOS 

Because the variety of iOS engineers at Slack grew previous 20, one space of fixed frustration was the checked in Xcode venture file. The Xcode venture file is an XML file that defines the entire Xcode venture’s targets, construct configurations, preprocessor macros, schemes, and way more. As a small staff, it’s simple to make adjustments to this file and commit them to the principle department with out inflicting any points, however because the variety of engineers will increase, the probabilities of inflicting a battle by making a change on this file additionally will increase.

 

“I believe the priority is extra so the xcode venture file, resolving conflicts on that factor is painful and error susceptible. I’m undecided what the perfect method is to assuaging this doable ache level, particularly if they’ve added new code recordsdata.”

“I had a dozen or so conflicts within the venture file that I needed to manually resolve. Not an enormous challenge in itself however while you’re anticipating to merge a PR it may be a shock”

The answer we carried out was to make use of a software referred to as Xcodegen. Xcodegen allowed us to delete the checked in .xcodeproj file and create an Xcode venture dynamically utilizing a YAML file that contained definitions of all of our Xcode targets. We related this software to a command line interface in order that iOS engineers may create an Xcode venture from the command line. One other profit was that the entire venture and goal degree settings are outlined in code, not within the Xcode GUI, which made the settings simpler to search out and edit.

After adopting Bazel we took it a step additional and created the YAML file dynamically from our Bazel construct descriptions.

A number of concurrent merges to primary have the potential to interrupt primary

To date we’ve talked about completely different points that builders can expertise when writing code regionally and opening a pull request. However what occurs when a number of builders try to land their pull requests to the principle department concurrently? With a big staff, a number of merges to primary occur all through the day which might make a developer’s pull requests stale shortly. The longer a developer waits to merge, the bigger the prospect of a merge battle.

An rising variety of merge conflicts began inflicting the principle department to fail as a result of concurrent merges and began to negatively have an effect on developer productiveness. Till the merge battle is resolved, the principle department would stay damaged and pause all productiveness. At one level merge conflicts have been breaking the principle department a number of instances a day. Extra builders began requesting a merge queue.

 

We maintain breaking the principle department. We’d like a merge queue.

We brainstormed completely different options and finally landed on utilizing a 3rd social gathering resolution referred to as Aviator, and mixed it with our in-house software Mergebot. We felt that constructing and sustaining a merge queue could be an excessive amount of work for us and that the perfect resolution was to depend on an organization that was spending all of their time engaged on this downside. With Aviator, builders add their pull request to a queue as an alternative of immediately merging to the principle department, and as soon as within the queue, Aviator will merge primary into the developer branches and run the entire required checks. If a pull request was discovered to interrupt primary, then the merge queue rejects it and the developer is notified by way of Slack. This technique helps keep away from any merge conflicts.

 

Means higher now with Aviator. Solely ache level is I can not merge my pull requests and should depend on Aviator. Aviator takes hours to merge my PR to grasp. Which makes me anxious.

Being an early adopter means you get some advantages but additionally some ache. We labored intently with the Aviator staff to establish and tackle developer pains resembling elevated time to merge a pull request in the principle department and failure reporting on a pull request when it’s dropped out of queue as a result of a battle.

Checking pull request progress/standing

This can be a request we acquired in 2017 in considered one of our developer surveys:

 

Would actually love well timed alerts for PR assignments, feedback, approvals and so on. Additionally could be good if we may get a DM if our builds cross (moderately than solely the alert for after they fail) with the choice to merge it proper there from slack if we’ve all of the wanted approvals.

Later within the 12 months we created a service which screens Git occasions and sends Slack notifications to the pull request writer and pull request reviewer accordingly. The bot is known as “Mergebot” and can notify the pull request writer when a remark is added to their pull request or its standing adjustments. It’s going to additionally notify the pull request reviewer when a pull request is assigned to them. Mergebot has helped shorten the pull request evaluation course of and maintain builders in circulate. That is yet one more instance of how saving simply 5 minutes of developer time saved ~$240,000 for a 100-developer staff in a 12 months.

Just lately github rolled out an analogous characteristic referred to as “github scheduled reminder” which, as soon as opted into, notifies a developer of any PR replace via Slack notification. Whereas it covers the essential reminder half, Mergebot continues to be our developer’s most well-liked bot because it doesn’t require express opt-in and likewise permits pull requests to be merged via a click on of the button via Slack.

Conclusion

We wish Slack to be the perfect place on the planet to make software program, and a method that we’re doing that’s by investing within the cell developer expertise. Our staff’s mission is to maintain builders within the circulate and make their working lives simpler, extra nice, and extra productive.  Listed here are some direct quotes from our cell builders:

 

Dev XP is nice. Thanks for at all times taking suggestions from the cell growth groups! I do know you care 💪

We’re utilizing fashionable practices. Bazel is nice. I really feel extremely supported by DevXP and their arduous work.

The instruments work properly. The code is modularized properly. Devxp is responsive and useful and continues to iterate and enhance.

Are a majority of these developer expertise challenges attention-grabbing to you? If that’s the case, join us!