Richardson: I’m Chris Richardson. Welcome to my talk on minimizing design-time coupling in a microservice architecture. In this talk, I’m going to answer three questions. What is design-time coupling? What problems does it create? How do we design loosely coupled services? I’ve done a few things over the years. Most notably, I wrote the book, “POJOs in Action.” I created the original Cloud Foundry which was a PaaS for deploying Java applications on AWS. These days, I focus on the microservice architecture. I wrote the book, “Microservices Patterns.” I help organizations around the world adopt and use microservices successfully through consulting and training.
Let’s now talk about design-time coupling. First, I’m going to describe the essential characteristics of the microservice architecture, including loose design-time coupling. After that, I’ll describe some of the techniques for minimizing design-time coupling. Then finally, I’m going to use the problem of ordering takeout burritos to illustrate potential coupling problems, and then show how you could eliminate them.
Microservice Architecture = Architectural Style
The microservice architecture is an architectural style that structures an application as a set of services. The services are loosely coupled. Each service is owned by a small team. Each service is independently deployable. The lead time for each service, which is the time from commit to deploy must be under 15 minutes.
Why Microservices: Success Triangle
Why use microservices? The adoption of microservices is driven by two important trends. The first trend is as Marc Andreessen said back in 2011, software is eating the world. What this phrase means is that a business’s products and services are increasingly powered by software. It doesn’t matter whether your company is a financial services company, an airline, or a mining company, software is central to your business. The second trend is that the world is becoming increasingly volatile, uncertain, complex, and ambiguous. Sadly, there’s no better example of that than COVID, which has been the ultimate disruptor. Because of the dynamic and unpredictable nature of the world, businesses need to be nimble. They need to be agile. They need to innovate faster. Because software is powering those businesses, IT must deliver software much more rapidly, frequently, and reliably.
To deliver software rapidly, frequently, and reliably, you need what I call the success triangle. You need a combination of three things: process, organization, and architecture. The process, which is DevOps, embraces concepts like continuous delivery and deployment, and delivers a stream of small changes frequently to production. You must structure your organization as a network of autonomous, empowered, loosely coupled, long-lived product teams. You need an architecture that is loosely coupled and modular. Once again, loose coupling is playing a role. If you have a large team developing a large, complex application, you must typically use microservices. That’s because the microservice architecture gives you the testability and deployability that you need in order to do DevOps, and it gives you the loose coupling that enables your teams to be loosely coupled.
I’ve talked a lot about loose coupling, but what is that exactly? Operations that span services create coupling between them. Coupling between services is the degree of connectedness. For example, in the customer and order example that I use throughout this talk, the create order operation reserves credit in the customer service and creates an order in the order service. As a result, there is a degree of coupling between these two services.
There are two main types of coupling. The first type of coupling is runtime coupling. Runtime coupling is the degree to which the availability of one service is impacted by the availability of another service. Let’s imagine that the order service handles a create order request by making a PUT request to the customer service to reserve credit. While this seems simple, it’s actually an example of tight runtime coupling. The order service cannot respond to the POST request until it receives a response from the customer service. The availability of the create order endpoint is the product of the availability of both services, which is less than the availability of a single service. This is a simple example of a common antipattern in a distributed application. A good way to eliminate tight runtime coupling is to use asynchronous messaging mechanisms such as the saga pattern. The order service could for example respond immediately to the create request. The response would tell the client that the request to create the order had been received, and that it would need to check back later to determine the outcome. The order service would then exchange messages with the customer service to finalize the creation of the order.
The second type of coupling is design-time coupling, which is the focus of this talk. Design-time coupling is the degree to which one service is forced to change because of a change to another service. Coupling occurs because one service directly or indirectly depends upon concepts that are owned by another service. Let’s imagine that the order service consumes the API of the customer service. It either invokes the services operations, or it subscribes to its events. Dependencies are not inherently bad. Quite often, it’s absolutely necessary. However, this creates design-time coupling from the order service to the customer service. The reason design-time coupling is a potential problem is because concepts can change. There is a risk, for example, that a change to the customer service will force this API to change in a way that requires the order service to also change. The degree of coupling is a function of the stability of the customer domain, the design of the customer service API, and how much of that API is consumed by the order service. The tighter the coupling, the greater the risk of lock-step changes. As I describe later, lock-step changes require teams to collaborate, which can reduce productivity.
Consequently, loose coupling is essential. It’s important to remember that loose coupling is not guaranteed. You must carefully design your services to be loosely coupled. Ideally, we should design services in a way that avoids any design-time coupling. For example, we might consider turning create order into a local operation by putting the customer and order subdomains in the same service. This might not be a good idea however, if it creates a service that is too large for a small team to maintain. In general, while we can try to avoid design-time coupling, it’s usually not practical to eliminate it. Instead, the goal is to minimize it.
Modularity and Loose Coupling Is an Old Idea
This is a talk about loose coupling and microservices. Loose coupling is an ancient concept that spans the entire design space. Parnas, for example, wrote a famous paper about modularization back in 1972. The title was on the criteria to be used in decomposing systems into modules. Many of the ideas in this paper are very relevant to microservices. At the other end of the spectrum, they also apply when designing classes.
Why Loose Coupling Is Important
Why is loose coupling important? The authors of the book, “Accelerate,” which is a must read book, have found there’s a strong correlation between business success and the performance of the software development organization. They have also found that developers in high performing organizations agree with the following statements, “Complete their work without communicating and coordinating with people outside their team. Make large scale changes to the design of their system without depending on other teams to make changes in their systems, or creating significant work for other teams.” Being able to work this way requires an architecture that is loosely coupled from a design-time perspective. In other words, loose design-time coupling makes the business more profitable.
Lock-step Change: Adding a COVID Delivery Surcharge
The opposite of loose design-time coupling is tight design-time coupling. Tight design-time coupling is an obstacle to high performance because it causes lock-step changes that require teams to coordinate their work. Let’s look at a simple example. Let’s imagine that the order service has an API endpoint for retrieving an order. The order has four fields: subtotal, tax, service fee, and delivery fee. What’s missing is a field for the order total. Perhaps this endpoint is automatically generated from the database schema, which does not store the order total. As a result, clients such as the accounting service must calculate the order total themselves. Initially, this was not much of a problem since it’s a very simple calculation. However, in March 2020, the organization needed to implement a COVID surcharge to cover the costs of PPE. Since the calculation wasn’t centralized, multiple teams needed to track down and change the multiple places in the code base that calculated the order total. That was a slow and error prone process. It’s a good example of the kind of change that impacts multiple services. To make matters worse, let’s suppose that the required change to the accounting service required a breaking change to its API. This will force the clients of the accounting service to also be changed in lock-step, requiring more meetings for coordination. In the worst case scenario, you can have what’s known as a distributed monolith where many or all of the services are constantly changing in lock-step. It’s an architecture that combines the worst aspects of both architectural styles.
Cross-team Change: Monolith vs. Microservices
This cross-team coordination also occurs in a monolithic architecture. It’s also undesirable. However, in a monolithic architecture, it’s easier to build, test, and deploy changes made by multiple teams. You can simply make the required changes on a branch, and then build, test, and deploy them. In contrast, deploying changes that span multiple services is much more difficult. Because the services are deployed independently, with zero downtime, you can’t simply deploy a breaking change to a service API. First, you must deploy a version of the service that supports the old and the new versions of the API. Next, you must migrate all of the clients to that newer API. Then, finally, you can remove the old API version. That’s a lot more work than in a monolith. As you can see, coupling at the architecture level results in coupling between teams. This is a great example of Conway’s Law in action. Here’s an interesting little tip, Mel Conway is on Twitter, and has some very interesting things to say.
DRY (Don’t Repeat Yourself) Services
There are several techniques that you can use to minimize design-time coupling. The first is to apply a classic design principle, Don’t Repeat Yourself. This principle states that every concept such as the order total calculator has a single representation in the application. In other words, there should be one place that calculates the order total. You might be tempted to use the traditional approach of implementing the calculation in a library that’s embedded in multiple services. While using a library for stable utility concepts like money is generally ok, a library that contains changing business logic is insufficiently DRY. That’s because all services must use the same version of the library. When the business logic changes and a new version of the library is released, numerous teams must simultaneously upgrade to that version, yet more coordination and collaboration between the teams. In order to properly apply the DRY principle in a microservice architecture, every concept must be represented in a single service. For example, the order service must calculate the order total. Any service that needs to know the order total must query the order service. That’s the DRY principle.
Icebergs: Expose As Little As Possible
Another principle that helps achieve loose design-time coupling is the iceberg principle. In the same way that most of an iceberg is below the surface of the water, the surface area of a service API should be much smaller than the implementation. That’s because what’s hidden can easily be changed, or conversely, what’s exposed via an API is much more difficult to change because of its impact on the service clients. A service API should encapsulate or hide as much of the implementation as possible. A great example of the iceberg principle in action are simple APIs such as the Stripe or Twilio API. The Twilio API for SMS lets you send an SMS to subscribers in over 150 countries, yet the API endpoint only has three required parameters: the destination number, the from number, and the message. This incredibly simple API hides all of the complexity of routing the message to the appropriate country. We should strive to apply the same principle to our services. The 1972 paper by Parnas even contained a few words of wisdom. First, list the most important and/or unstable design decisions. Second, design modules, or, in this scenario, services that encapsulates those decisions.
The iceberg principle is concerned with minimizing a services surface area. To ensure loose coupling, a service should also consume as little as possible. We should minimize the number of dependencies that a service has since each one is a potential trigger for changes. Also, a service should consume as little as possible from each dependency. Moreover, it’s important to apply Postel’s Robustness principle and implement each service in a way that it ignores response and event attributes that it doesn’t need. That’s because if a service selectively deserializes a message or a response, then it’s unaffected to changes to attributes that it doesn’t actually use. One thing to keep in mind, interestingly, is the code generated deserialization logic, typically deserializes all attributes.
Use a Database-Per-Service
Another key principle that promotes loose coupling is a database per service. For example, let’s imagine that you refracted your monolith to services but left the database unchanged. In this partially refracted architecture, the order service reserves credit by directly accessing the customer table. It seems simple, but this results in tight design-time coupling. If the team that owns the customer service changes the customer table, the order service would need to be changed in lock-step. In order to ensure loose design-time coupling, services must not share tables. Instead, they must only communicate via APIs.
Takeout Burritos – A Case Study in Design-Time Coupling
I now want to discuss an example of design-time coupling that’s motivated by my excessive consumption of takeout food over the past year. I’ve had a lot of time on my hands to study the domain quite thoroughly. We’ll explore how to improve an architecture so that it’s better able to handle evolving requirements. The example application in both of my books is the Food to Go application. It’s a food delivery application like Deliveroo or DoorDash, but unlike those two companies, its fictitious stock has actually increased in value since it IPO’d. Originally, Food to Go had a monolithic architecture, but over time the applications team grew. It was migrated to a microservice architecture. Here are some of the key services. The order service is responsible for creating and managing orders. It implements the create order command using the saga pattern. The order service first validates a request to create an order using a CQRS replica of the restaurant information, which is owned by the restaurant service. Next it responds to the client with the order ID. The order service then finalizes the creation of the order by asynchronously communicating with other services. It invokes the consumer service to verify that the consumer can place orders. Next, it invokes the accounting service to authorize the consumer’s credit card. Finally, it creates a ticket.
I want to focus on the design-time coupling of the order service, and the restaurant service. The primary responsibility of the restaurant service is to know information about restaurants. In particular, its API exposes the menus. In this example, the restaurant service publishes events, but the design-time coupling would be the same if it had a REST endpoint. The menu information is used by the order service to validate and price orders. Let’s now explore the impact of changes to the restaurant’s subdomain. The first change I want to discuss is supporting menu items that come in different sizes. For example, let’s imagine a restaurant that sells chips and salsa in two different sizes, small and large. We can support this requirement by introducing the concept of sub-menu items. The menu items such as chips and salsa can have two sub-menu items, one for each size. We can expose the menu item hierarchy to clients by adding a parent menu item ID to the DTOs and events. This is an additive change and so it’s a non-breaking change. The order service can ignore this attribute and so is unaffected by the change.
Let’s now look at another change which is superficially very similar but has a much bigger impact. Some restaurants have menu items that are configurable. For example, one of my favorite restaurants lets you customize your burrito. There are numerous options including paid add-ons such as roasted chilies and guacamole, so delicious. Adding support for customizable menu items requires numerous changes to the restaurant order and kitchen domain. A menu item needs to describe the possible options. It has a base price. A menu item has zero or more menu group options that are named and have min and max selection attributes. Each menu item group options has one or more menu item options, and a menu item option has a name and a price. In order to calculate the subtotal, an order line item needs to describe the chosen options. An order line item has zero or more order line item options, which describe the selected option. Similarly, in order for the kitchen to prepare an order, the ticket line item must also describe the selected options. However, it just needs to know the names of the chosen options for each one of the line items. This is an example of a change that has a widespread impact. The teams that own the three affected services would need to spend time planning, implementing, and deploying the changes.
Ideally, it would be nice to architect the application in a way that avoids this scenario. As a general rule, concepts that are hidden can be changed. Therefore, we need an architecture that encapsulates the menu structure within the restaurant service. Let’s look at how to do that. In this example, the order service is coupled to the restaurant service because it uses the menu items, and it stores line items which reference the menu items in order to record the actual order. The order service also uses the menu items to validate the order and calculate the subtotal. We could therefore reduce coupling by moving those responsibilities to the restaurant service. In this new design, the order service is significantly less coupled to the restaurant service. It simply depends upon the concepts of order validation and the calculated subtotal, which are much simpler and much more stable. Perhaps the one downside of this approach is that the restaurant service is now part of the critical path of the ordering flow. Previously, the order service had a replica of the menu items that was maintained with events published by the restaurant service. In a sense, we have reduced design-time coupling, but increased runtime coupling. This is an example of the types of trade-offs that you must make when defining a microservice architecture. We can also decouple the kitchen service from the menu item structure by using API composition. Instead of the tickets storing those line items, the UI can dynamically fetch them from the restaurant service when displaying the ticket.
I want to discuss the design of the saga that coordinates the creation of the order and the ticket. There are a couple of options. The first option is to use a choreography-based saga. The API gateway publishes an order creation requested event. Each service subscribes to that event. The ticket service creates a ticket. The order service creates an order. The restaurant service attempts to create an order as well. If it’s successful, it publishes an order validated event containing the order subtotal. If it’s unsuccessful, the restaurant service publishes an order validation failed event. Other services subscribe to those events and react accordingly.
Another option is to use orchestration. The API gateway routes the create order request to an orchestration service. The orchestration service invokes each of the services starting with the restaurant service using asynchronous request-response. Orchestration and choreography are roughly equivalent. They differ, however, in some of the details of the coupling. All of the participants in the choreography-based saga depend upon the order creation requested event. In fact, the teams actually need to collaborate to define that type. In contrast, the saga orchestrator depends upon the APIs of the participant. In a given situation, one approach might be better than the other.
Rapid and frequent development requires loose design-time coupling. You must carefully design your services to achieve loose coupling. You can apply the DRY principle. You can design services to be icebergs. You can carefully design service dependencies. Above all, you should avoid sharing database tables.
Questions and Answers
Watt: One question which got a plus one for quite a few things was about your suggestions for solving problems involving asynchronous APIs as an entry point, when the API initiates an asynchronous communication, but then still needs to respond to a synchronous request. Maybe you’d like to elaborate on that a little?
Richardson: Ideally, a synchronous request just initiates things and then the request handler can return immediately, like it’s returned a 201, whatever it is, to say created. If it has to wait until that whole saga has completed, then each instance of the service that handles those requests can have its own private subscription to the events that would indicate the outcome of the operation that was initiated. Like, it could subscribe to order created, and order failed events. You can imagine this in a reactive interface, where the synchronous request handler returns a CompletableFuture or whatever reactive abstraction you’re using. Then there’s a HashMap between Request ID, you need some Correlation ID, so that when an event comes back saying that that order was created successfully or unsuccessfully, the event handler can then take the Correlation ID, look up the Mono or CompletableFuture. Complete it, which would then trigger the sending back of the response. It’s a little messy. It’s a little evolved. It has the downside that there’s runtime coupling in this architecture. I’ve worked with clients that have just had to do that. One of them even had a SOAP API little things, and the thread actually had to block until the message handling had completed.
Watt: Sometimes it’s not always straightforward to do it in the perfect way.
If domain driven design is done properly, and you can identify aggregates, aggregate routes and entities, and shared kernel properties, can design-time coupling be fully addressed?
Richardson: I want to say yes and no, but one of the things is if it’s done properly. If you can do it properly, I think that does address some aspects of design-time coupling. On the other hand, you do have to make decisions about decomposition into services. I think I made this point in the talk that service boundaries are these physical boundaries, because they involve network communication, and so on. There’s this slightly different set of concerns that you have to address, which aren’t totally aligned with just traditional DDD or traditional modularity. I don’t think just doing DDD is sufficient.
Watt: Absolutely. I think there’s many ways to sometimes decide when things need to be microservices in the boundaries, which can be technical as well, not necessarily just domain driven.
With a database per service, what is your suggestion for creating or maintaining a unified data model across the enterprise? Is that even a good thing?
Richardson: I know that’s a really interesting thing, because I think in enterprises, there’s a strong desire to do that. If you look at even just one of the key ideas in domain driven design, it’s of a bounded context and this notion of having multiple models instead of one large union, like global view of what a customer is. Even just from a DDD perspective, never mind microservice perspective, a global model is generally not a thing. One way of looking at it is the model exists in terms of the APIs that your services expose. I think, yes, you can have consistency around that. Like a customer name is represented in a consistent way across all of your APIs, or an address. I think it’s a more distributed way of thinking.
Watt: I think sometimes there’s an under-appreciation for some of the coupling that comes even with the messages when you use asynchronous communication through messages and things like schemas and things like that. Do you have any practical tools and tips that you would use to minimize the impact of when schemas change in messages between services?
Richardson: It is tricky. In the ideal world, your events evolve. The schema of the events evolve in an always backwards compatible way, so that the changes that you make are additive. The events in the life cycle of some domain object could, in theory, change in incompatible ways. That’s rooted in business concepts, which I think have some stability. Part of it is, if they change in incompatible ways, you have to update all of the consumers so that they can handle the old and the new schema. Then once you’ve upgraded them, you can then switch over to publishing events in the new schema. Any major changes like that, I think, involve a certain amount of pain.
Watt: In your talk, you mentioned about an additive change to an API. Can you expand a little bit on what you mean by that?
Richardson: There’s two parts of synchronous, there’s a request and there’s a response. You can add optional attributes to a request. Old clients obviously don’t know about that attribute. They can still send their old request, the server can provide a default value. Then, conversely, in the response, the server can return additional attributes, and the client can be written in a way that it just ignores the ones that it does not understand. There’s extra attributes, but they’re not relevant.
Someone commented about teams creating lots of fine grained services. My recommendation is start with one service per team, unless there’s a good reason to have more. Whereas I’ve definitely seen quite a common antipattern, is like one service per developer. Then there are some more extreme examples publicly, but that one service per developer seems quite common. To me, yes, it just seems like you’re creating an excessively fine grained architecture. I would just simplify things, because there’s a chance that down the line when there’s some change, suddenly, you’ve got to make changes in a lot of places. Or, you just end up ultimately building this overly complex system that you will find is just cognitively overwhelming.
See more presentations with transcripts