If a microservice in the cloud stops responding,
and no logs were generated,
did it fail at all?
Suppose I have two applications, one a “classical” monolith, called so even though it was built using a multi-tiered architecture, the other using a microservices architecture. Furthermore, lets keep it simple and assume the functional requirements were the same, so they both provide the same functionality to the organisation and users. What can we say about how difficult they are to build and run? How do they compare? Why would I prefer the one over the other?
The monolith looks like the easy winner in build effort if I were starting from scratch, having the advantage of a simple infrastructure and application design. Hokay, now let’s add in a sense of reality: I have performance requirements, splitting the users in internal (employees), external (customers), and IT staff. (“keep everything running”) So the infrastructure needs a bit more thought; probably a separate database server and web-application server. Maybe a separate web-application server for the internal users, so their work won’t inconvenience the customers and vice-versa, and a common back-office system to tie it all together. Security is a thing as well, so set up a secure directory server to manage accounts, a DMZ around the front-end to secure the back-end further, some DDOS protection, you know the drill.
We’re hitting all the important points, so let’s take a look at the alternative: We take an account at one of the major cloud providers, select the appropriate lego-blocks from the catalogue, and start filling it with components. Funny thing is, this sounds easier, because we don’t have a lot of the infrastructure hassles.
Fast forward to the working application and we’re ready to throw a wrench into the system. Monkey wrench? Yeah, a Chaos Monkey wrench. Customers start calling that the site isn’t working. Now what?
Sometimes, size matters
The attractive thing about monoliths, is that everything is within reach. If I implement an order flow, I can take the order from the customer’s input, and pass it directly into the appropriate employee’s inbox. If I want to keep track, I’ll store it in “the” database with a status, and everyone can immediately see what the progress is. However, if the number of customers grows, especially the number of customers shopping around at the same time, I suddenly have a problem: the amount of work I do per order, starts to cause new orders to slow down. Or worse, they overload parts of the system and start causing errors. Now I have unhappy customers, lost orders, and lots of manual work trying to figure out how to correct stuff.
Ok, bigger server to deal with more customers, right? But also more support for dealing with those errors and maybe try to split up the ordering process so the customer doesn’t have to wait for all of the work we tried to fit in there. Ok, we’re progressing into the Service Oriented Architecture era here. Rework and think about how the business processes actually work, and make them into services that can be called. We also need to do something about the scenario of waiting for an overloaded service, so let’s use an asynchronous message bus so we can “shoot and forget”. A nagging worry is now that our IT support staff needs to grow as well, because of all the “moving parts” in my architecture. But still, manageable.
Errors playing hide-and-seek
If the problem is in our Cloud-native solution however, we have another hunt altogether, because first we must find the part that is failing. Because the front-end is placing calls at several other services, and maybe even collects data from an asynchronous source, it isn’t immediately clear what the cause for our problem is. Sure, all our services are producing logs, but we don’t want to hunt through tens (or hundreds; think positive) of logs to find the one error message at the root. Even if we were smart enough to gather all logs to a central location, doing a root-cause analysis can be a complicated hunt.
Next, when we identify the culprit, we need to think up a strategy for solving it, that takes the distributed nature of our application into account. It may very well be that we need a combination of fixes to several services. What soon becomes clear is that “Breaking up the Monolith”, as Martin Fowler calls it, is only a small part of the difference. As we increase the number of “moving parts” in our application, the number of ways they can combine to cause the most magnificent of failures, is huge. Also, we have replaced some basic concepts like “call a procedure” or “store data in the database” with synchronous or asynchronous communication, where the likelihood of failure without even involving the service called is not easily ignored. Now look at the mess I got you in!
Errors, faults, and failures
So basically, what I am arguing here is that microservices are hard because we’re confronted with a higher likelihood of errors resulting in failures. Whereas we’ve learned to selectively ignore error-return values, eat exceptions, and trust values passed to us, we now run in an unforgiving environment where such things can cause spectacular (or worse: barely noticed) blow-ups. Yes, some of those are transient in nature because they happen in the added communication layers, but we can no longer pass the buck upwards, without making it an explicit action. In a monolith, you can sometimes just ignore an error, knowing it will be safely caught further down the process. Worse, we sometimes accept that manual intervention is needed to repair the situation, arguing that making the application correctly deal with the problem is more expensive.
What we need to deal with this, is to become ruthlessly fault tolerant. Deal with failures consistently, building resilience in from the start. Something we should have done in the first place, but were cuddled (or “bludgeoned”? Some project managers can be pretty blunt about their priorities) into ignoring. If we transition to microservices, without first adopting this approach to fault-tolerance and resilience, then we’re royally screwed.
I know the “usual” cautions against microservices are more about organisation and process, but those are only the short-term enablers. The really long-term success with microservices architectures requires an thorough understanding of software quality, and a serious dedication to its pursuit.