Blossom: I’m going to talk about user simulation for rapid outage mitigation. My name is Carissa Blossom. I have been an infrastructure engineer at Uber for over four years, working primarily in the SRE and production reliability space. For more than two years, I have been a commander for Uber’s elite outage mitigation volunteer squad. I’m currently working to scale Uber Eats reliably.
The Story of March 30, 2020
Let’s start with a story, March 30, 2020, just before 4 p.m. Most of us in San Francisco were busy trying to wrap our heads around our new quarantine lifestyle, that radical shift that most of us thought wouldn’t last more than a few months at most. Irina, this week’s on-call engineer for Ring0, Uber’s elite outage mitigation volunteer squad, had no time for such thoughts. Her phone had just lit up with a page, five cities down in the East Coast U.S. region. Within 10 minutes, all traffic was drained out of the region. Incident mitigated. It would take another 90 minutes for one of the hardware management teams to realize that a thoroughly tested upgrade had caused almost 4000 hosts in the region’s data center to go down. The process to recover those hosts would take days. Uber’s rides, restaurant, and freight business recovered without significant degradation in a matter of minutes. Until my telling of this story, nobody outside Uber engineering was any the wiser, all because one woman got one single page and decisively took action to mitigate. I’d like to tell you that this story is rare, that Uber’s engineers are perfect, infallible, and that our systems never make mistakes. This is not the case.
Uber’s Observability Solution
I’m not here to talk to you about failure, heroism, or even mitigation tooling. I’m here to talk to you about what comes before that. I’m here to talk to you about that page. I’m going to start with the observability solution that Uber developed to efficiently identify broad outages across our applications. You first need a little bit of background on our architecture. Behind Uber is an architecture comprised of over 4000 microservices, as well as 4 monorepos that still involve microservice-like individual service deployments. The core flows for the Eats and Rides apps alone can involve several hundreds of services, the dependency graph between them is quite challenging to navigate. At this point in Uber’s development, no single person can map out the entire architecture from memory.
Multiple Deployment Approaches
Adding to this complexity, there is no structure or process around who can deploy what service or when, besides a strong encouragement that service owners deploy slowly and incrementally across zones, starting with a canary zone and hopefully preceded by a staging rollout. In addition to deploy based rollouts, Uber has three different processes for rolling out different features or product configurations. These changes are rolled out not just by engineers, but by operations team members in cities all around the world. They’re rolling out these changes on the city, zone, region, and global levels. This system may seem chaotic, but it’s central to our let builders build method, which allows us to build fast and cater to the unique demands of cities and local governments.
Necessary as it may be, it presents a challenge. How do we identify problems across this broad interconnected network of services, especially when the issue might be at one of these very points of interconnection? If we relied on standard business metrics to determine when something was wrong with some facet of one of Uber’s applications, how would we make sense of that data when the implementation of each product in each city is so different? Even if we could take these business level metrics, and easily make them into alerts, how would they have any significance given all of these different markets? We needed an external monitoring system, one separate from Uber’s complex architecture, one that could simulate the experience that users would have, riding, driving, preparing, or eating through an Uber app, in distinct cities all over the world. We’d still have our standard business metrics and service based metrics to back up our new monitoring system. They’d only have to validate and provide further color to problems we’d be much better suited to identify and tackle.
Blackbox, as we named it, runs separately on an entirely independent infrastructure stack, leveraging multiple cloud providers to get us as close as possible to the real user’s experience. How do we anticipate and simulate the pain of our users without actually causing them pain? The first piece of the puzzle called test accounts. Test accounts are, with a few exceptions, identical to real production accounts. That’s intentional. Certain services need to know that an account isn’t real, so we don’t count or bill users like production accounts or match them with a production user. With these cases handled, all other services essentially treat all accounts, production or test, exactly the same.
Special headers, we call it tenancy headers, have been added to all accounts, which identify them as production or test, so that services that need to be able to distinguish are able to do so. While almost identical to real production accounts, it is important to note that test accounts are in no way based on real user data. It’s all randomly generated. Test accounts by themselves don’t actually do anything, they just exist. We needed a system to coordinate the actions of different users as they traverse Uber’s systems. What we needed were a series of large, extensive integration tests, which could move the test accounts through a real user’s experience. Uber’s stack is far too complex for someone to be able to effectively write an integration test that covers it all. Besides, there’s no real ROI for any one person to take on the task of writing the whole thing themselves, let alone maintaining it, even if they could.
Composable Integration Tests
Bring in the composable testing framework. A given service owner may not know the whole stack, but they’re experts on their own services, and they know exactly how it fits into the bigger picture. They know what potential states users are in when they interact with this service, and what states they should be living in. We leverage this expertise to create a system that works a bit more like Legos. We empower users to be able to write miniature tests with expected input states for execution that could function as nodes in a series of finite state machines.
Sample Composable Integration Test
Let’s say that we own one of the services responsible for the driver’s flow, and we want to make sure that drivers properly receive a trip flow offer. We expect the driver to already be online and available before it gets to our test case. We can determine by writing the test what state the user should be in when it finishes our test case. Does he choose to take the offer or not? We leverage the library of test modules available to test our trip offer in the context of a broader scope of the flow, and to even bring in other types of test accounts like rider accounts to bring a much broader test. In this way, service owners can take everything else for granted and still test their code out in the context of as much or as little of the stack as they want.
Let’s try writing it ourselves. Here we have a basic struct for a test case written in Golang. In this case, the Driver Get Request struct, which takes a data provider which helps us populate the test case with test accounts and whatever other data the test case may need. We define the run function next, which tells the test framework what to do when it hits this specific test case in a given flow. Now, we expect our driver first to be online before they can receive an offer, but our team doesn’t do that. That’s ok, we don’t actually need to know how that part works. We can just import The Driver Go Online Module from the test case library, and then use it to create a bigger integration test with multiple test cases. With this framework, writing integration tests becomes simple, it just requires each team to keep their Lego pieces or nodes up to date. They’re incentivized to do so. Otherwise, we don’t have proper insight into potential outages in their part of the stack.
Composable Integration Tests Uses Cases
To make our testing framework just that little bit more awesome, we wrote it so that it would work everywhere. Not just on our external monitoring system, Blackbox, you can also run these tests pre-commit or pre-deploy via Jenkins. It’s also how we simulate peak load. We estimate the number of users on top of current production traffic that we want to run a load test for, and we spin up the additive delta of test accounts with whatever configurations we want. We have them all execute the same integration test. This is how Uber learned to anticipate impending high load days, like New Year’s Eve.
Back to Blackbox
Now, we’ve got all these tests that we can run everywhere with all of their test accounts and we have the ability to simulate business features, and it’s all built into Blackbox. We configure Blackbox to run these integration tests based on all of the unique products, features, and configurations for each city. That’s a lot of information about a lot of cities, so a key feature of blackbox to solve is the ability to automatically provide a high level assessment of all failures, broken down by different factors associated with the tests that are currently being viewed. It automatically bubbles up the most critical failure information for mitigation to the top. Let’s see what this actually looks like.
This is a slightly scrubbed version of Blackbox. Here you can see the failure domains, which refers to the specific test that we are currently looking and gathering all of the information around. In this case, you’re looking at Uber’s two default tests: trip flow, and Eats. These are referring to the impact tests, which encompass the entirety of our core flows for our two major businesses. This is all of the critical features, as well as some additional ones, all in one test. Even as we’re looking at these two all-encompassing tests, it might also be helpful to know what other more specialized tests from different teams are failing at this current time. They can provide color and insight into whatever outage we are currently trying to solve.
The next valuable bit of information is this timeline. This is the timeline of failing tests over a given period of time, in this case, the last 15 minutes. The way that this graph manifests can give us a lot of valuable information about what type of system might have rolled out the problematic code or change. For instance, something that was deployed via a configuration flag, which would essentially be an on/off switch might manifest as a sudden massive spike in failures. Whereas a problematic change rolled out via a deployment system, because deployments are slowly and incrementally rolled out across hosts in a given zone and then region, will also manifest more gradually on this graph here.
Next, you see the failure domain. These are the different attributes that we mentioned earlier, which have been ranked based on the criticality of the particular attribute. In this case, you’ll notice that zone has been bubbled up higher than the failure cause. That’s intentional, because at Uber we mitigate using zone and region drains for a single service or an entire zone or region. That’s actually more important for mitigating an outage than the failure cause which would refer to the endpoint and the status code, which is more important for actual root cause analysis down the line. Finally, we see a list of recent test runs shown as a graph, so each of these dots refers to an individual integration test run on what we call a prober. Probers are essentially Docker containers with an identical base image that exists on hosts on distinct cloud providers in regions all over the world. Whether the dot is red or green, tells you whether or not that particular test run succeeded or failed. By clicking on any of these individual dots, Blackbox takes you to a secondary screen, which gives you a significant amount of data about that particular test run, including the full integration test list of endpoints, as well as the error trace and further information.
Let’s dig a bit more into those probers. Blackbox was intended to simulate the experience of real users as closely as possible. That means test accounts on Blackbox need to reflect the diversity of different operating systems running different versions on different carrier networks, just like real users. We refer to the Docker containers where these individual tests are executed as probers.
They run on hosts in different cloud providers in regions all over the world. They report the success or failure of the individual tests to a separate special set of hosts called aggregators.
Aggregators answer the question of how do we get actionable signal. We don’t want too many false alerts or to wake people up at every hiccup or individual prober failure. Instead, we developed probe aggregators, which are a ring of three to five nodes with a master node, which develop consensus so that we have confidence that there’s an actual issue. They aggregate the results from the probers, and the master node determines when to raise the alert about a given failure. The addition of Blackbox has allowed us best-in-class mitigation times. It has changed our approach to on-call management, from one based on solely identifying root cause and fixing or undoing whatever the issue was, to one where we focus first on mitigating the outage, so we then have the privilege to be able to dig into whatever the underlying root cause is, without any user impact. It empowered engineers, like the on-call engineer Irina, to mitigate our hardware outage in a matter of minutes instead of hours.
What do you do when the problem isn’t isolated in such a way that a mitigation strategy like a zone or region drain is possible? What do you do when all of your zones or regions are impacted at once? What do you do when there is no viable mitigation strategy? Now that we have significantly reduced our time to mitigation, our next challenge is this time to resolution, which still takes hours if not days. How do we get to failure attribution faster? It turns out that the answer was waiting for us, by pairing machine learning with one of Uber’s most famous open source tools, Jaeger tracing. Jaeger is a distributed system that leverages distributed context propagation and transaction monitoring to provide observability into microservice based systems.
Failure Attribution via Machine Learning with Jaeger
At Uber, we’ve already got Jaeger tracing across most of our critical services, by creating a new root span at the start of every integration test by default as part of the test framework itself, and ensuring that all services on our core flow continue and propagate with further spans. Starting from this root span, we have been able to complete tracing across our integration tests, creating several thousand span long traces, encompassing an integration test traversal through the entire synchronous flow for the Eats or Rides product. With tracing turned on for all of Blackbox’s impact test, those are those two core tests for Eats and Rides, we have been able to leverage the frequency of tests run on Blackbox to compile significant amounts of data into the success and failure paths for a given test in a given city. By feeding this data into a machine learning model, we are gaining the ability with increasing accuracy to predict, given that Blackbox already tells us which endpoint failed, which specific service is most likely responsible for an outage. With this system, we’re closer than ever to being able to accurately predict the root cause of an outage right from Blackbox.
Let me show you what it looks like. This is a simplified version of our Uber failure attribution system, we called it KAIJU. In this case, we have a simplified version of an integration test, which only has four services, let’s say, two or three test cases. KAIJU shows you a map of all of the services for a given endpoint that would have been hit in a success case. Services in this particular case, if it failed, that did not get hit in this particular test run are grayed out. While the service that KAIJU thinks is most likely responsible if there is a failure in the test run at that endpoint, are bolded and covered in a red line. In this case, our first endpoint succeeded, so you see a green dot next to the endpoint name, but our second one failed. If we click on that endpoint with a red dot, you’ll see a map that looks like this.
One might assume based on the fact that the failing endpoint is Driver Go Online that the issue would be with the driver service or the underlying database. Without KAIJU, that would be a very valid assumption to make. As KAIJU shows, that is very wrong. The database never even got touched. What was actually responsible was a very unsuspecting service, city safety service. By helping us to efficiently identify this service of root cause, KAIJU has saved us a significant amount of time that it would take first talking to the driver service team and then to the storage team before one would even think of checking other parts of the stack. Just in case you prefer the classic Jaeger view, which has a more span based focus, we have a button which will take you to the old-school Jaeger view directly at the top right of the screen.
Recapping Introduced Tools
We’ve developed test accounts to closely simulate real users. We created a composable integration testing framework to simulate real user’s experience on the ground, and solve the specialization problem that comes from a stack too large for any one engineer to fully comprehend. We created an external testing and monitoring system to run these tests with these accounts configured for all the unique city and product specifications one might want to cover. We’ve created a failure attribution tool built on Jaeger that empowers us to narrow down complex outages to a single potential root cause. Blackbox has enabled teams to develop quickly but safely, and it encourages us to always keep the customer foremost in our consciousness.
With Blackbox, we are one step closer to the dream of automating the mitigation of outages and providing reliable predictive failure attribution at Uber. With this system in Blackbox, we’re closer than ever to achieving this goal.
If you’d like to learn more about our stack here at Uber, please check out our engineering blog.
Questions and Answers
Tucker: There were a couple of questions in here about adoption. How do you incentivize teams to write these? Have you found it to be difficult to get them to buy in?
Blossom: The composable testing framework is something I actually helped build. I was there when this all took place. To be honest, I think part of what helped was the fact that I also run the outage mitigation squad. Whenever Uber has really terrible outages, when these Blackbox alerts go off, I’m one of five people in the company that is there on the ground at 3:00 in the morning, helping these teams to actually solve their problems and get out of these outages. It gives me a little bit of additional leverage, frankly, that when I’m asking them to do something like this, people usually respect my opinion because of that.
Another thing that helped was we strategically chose to build this system directly with our central marketplace team. Just to clarify what that is, we call the marketplace the meeting place where the riders, drivers, eaters, couriers, all these different critical users to our system, where they come together and they meet. That’s our marketplace. The system that builds those holds a lot of the really critical modules. They’re the ones who would develop the Rider Go Online module, and the driver accepts a request, all this stuff. Having those teams, even before we launched it more broadly, come in and actually have all of these really critical modules built, made things really simple because people were already trying to solve this problem. They had been experiencing this pain of, “I don’t know how that works. I just take it for granted.” Having that all there, right when we launched it, meant that people were pretty enthusiastic about this right from the beginning. It made them a little bit more forgiving as we discovered issues and had to tweak things along the way.
Tucker: Can you just clarify for users, what environments are these tests executing in? Are they stable? Are you able to get real solid signals out of them?
Blossom: Yes. Just to clarify, I’m going to talk about Blackbox and then CTF because the answer is a bit nuanced here. The composable testing framework was meant to solve the problem of write once, run everywhere. Previous to CTF, Composable Testing Framework, we had a simpler version of the complete integration test that the marketplace team actually built entirely on their own. They had to go and talk to all these different teams and this mitigation squad. We were instrumental in helping bridge those connections. It was written entirely in Python, and it only worked on Blackbox. The first thing people started asking us was, “That test is awesome, can I run it in more places? This thing is catching outages after production, but I want to catch it beforehand.” Yes, they are like heartbeats. They wanted to take advantage of this, people wanted to create more reliable systems. We developed CTF to give that to them, and to help solve this problem. The CTF framework was intended to provide that functionality. We have a different set of runners to allow it to run in every environment. We have one which is meant for running it on your local environment, and it sets up at same Docker container. Even though each runner is different, the underlying system of having this identical image is a critical part of making it work the same everywhere. That’s how it works locally.
Staging gets a bit more complicated. Yes, you can run it on staging, but each team manages and maintains their own staging environment. You can imagine 4000 microservices? We don’t have a fully integrated staging environment. I dream about that day. We’re not there yet. In every other regard, we do run this. Hailstorm is what we call our load testing service, which is a critical way this runs, and this is how we prepare for high load days. We’re running a lot of these right now as we get up to New Year’s, maybe not this year, but most of your people are taking a ton of Uber trips at the same time. These are some of the different ways it runs. Blackbox is essentially our production version of this test and the most critical utilization of it, but it’s just one.
Tucker: Do you isolate the integration tests to a specific domain at all, or how do you isolate it so that downstream services owned by other teams are not executed as well, or do you do that at all?
Blossom: Actually, the downstream impact is one of the best parts of this whole thing. We have so many teams, it’s really hard to force every team to think about reliability. Different teams have different sizes and different bandwidth and workloads, and sometimes it’s just really hard for a team to prioritize this with everything else. One of the benefits they get is if their upstream service is super gung-ho about reliability and maintaining these regular integration tests, their services in that particular zone or region are going to get hit. Yes, we can do some degree of domain isolation, by strategically picking which zone we want to run it in. I’m referring to zones and regions, just to be clear, any way that one might hear this in GCP. Two to three zones make up a single region, we have multiple regions. We can pick a strategic zone and run it there. We even sometimes will spin up new zones, specifically for the testing purposes, at specific sizes.
Sometimes we want to test on an empty zone with all the traffic drained out of it, and then run just a very concentrated number of tests. We can simulate load much higher than the number of servers through basically picking a zone size and being like, so we’re going to run the standard number of tests on a zone a third of the size to really push the servers. Or maybe we want to test the same thing on a regular sized zone and see how there’s that variation, in order to test different systems. The nice thing about running it on a zone that’s a third of the size is that you can essentially simulate a higher load for that service without actually impacting the downstream services with the same amount of higher load, because to them, which are running on a normal sized zone, it’s an easy, acceptable amount of traffic. That is some way that we can simulate domain isolation, but I generally don’t like it when teams do that, because you’re losing a lot of benefit.
Tucker: Do you mind talking a little bit about the prerequisites or requirements for getting started writing composable tests, what you would recommend teams get?
Blossom: I don’t know if there are any hard prerequisites, I think starting it early is always a great idea at any stage in your development. I think the concept of a composable testing system is something that works for any size, and if you build this in early. One of the things we learned the hard way was to make sure for each team, you allow them to isolate and distinguish between the startup, which is basically all of the modules leading up to the part they care about, and the stuff they actually care about, and then the tear-down. Because people don’t want to necessarily know when stuff blind to other teams breaks, but we do as a system, as production engineers, we want to know. Giving that to them really ups the morale of people as they use the system. As for when you start using it, start today.
As for the question about test accounts messing up business traffic. The security assessment for test accounts was one of my favorite projects. This is where this tenancy header is really critical. It’s part of why test accounts are almost identical but not identical. There are certain services that we use to track, basically, like our actual on-trip numbers and stuff like that, that they won’t hit. In terms of separating in business metrics on the core flow, the tenancy headers are critical to that. We actually expose those to our metrics. You can look at any of our core dashboards on the business metrics, and distinguish between, do you want to look at just production or do you want to look at test and production? There might be reasons why you want to look at both, or one or the other.
Tucker: Do you have a feeling for how often you’re catching things with your tests as opposed to actual users? I would imagine you still have alerting for your real users as well. How often do you think the tests catch something versus your other production alerts?
Blossom: Usually what happens, in my experience, and I have been the first person receiving these alerts for over three years, either the Blackbox system goes off and our regular metrics at the same time, or Blackbox just goes off by itself. It’s really rare that something is caught by our business metrics, and Blackbox is totally mute. When that happens, it usually has to do with the dispersal of the problem. There’s some really narrow nuances to why that might happen. We work pretty closely with our Ops teams, especially on Uber Eats, we’re really tight with our Ops groups. Sometimes they tell us about things that Blackbox doesn’t catch, and neither does business metrics. I am super grateful to those people, but at the same time then I need to work to do better. That’s where we’ve found a hole in our integration test. It’s usually not from business metrics. It’s real people on the ground telling us that there is an issue.
See more presentations with transcripts