You need to be this tall to go from monolith to microservices—Part3
Or what we would have liked to know before we started
Fourth axis: Production
We have reached the fourth and final axis of maturity: production. It’s the one that will offer you the most challenge in your transition from a monolithic architecture to microservices, especially if you burn the stages.
In our transition to microservices, we started by deploying our applications on two existing instances (virtual machine) to ensure some resilience. However, the idea of a microservice architecture is also to have a system that is resilient to failure so this situation should only be temporary. In order to ensure a true resilience of your platform, it’s best to have one VM per service instance (autonomy principle). The other advantage is the ability to scale transparently. In our case, we switched to a more flexible and elastic “private cloud” (with Docker Enterprise).
But at this moment, the production will meet its first real difficulty. In case of errors, it’s no longer a question of browsing each log files by connecting to each instance. It will be necessary to set up from the beginning a solution allowing to explore your logs in a centralized way. We think of course at Elastic suite but it can be Splunk, Graylog or SaaS solutions like Logmatic.
By the way, you should very quickly retrieve metrics from your instances such as the CPU, memory, network, file system load, API response times or the Garbage Collector frequency for example in the context of a Java application. To do this, you can use agents (Metricbeat, Collectd, Telegraf…) or send the metrics directly by the application (Spring Actuator allows you to do this easily). These metrics can represent a large volume (especially if you are looking for real time) and it may be useful to switch to adapted databases like time-series (Prometheus, InfluxDB…).
Now you can update your dashboards (didn’t you already have some?) to monitor your system in real time. You can install a wide screen in the open space, taking care to select the essential information (number of errors, connections or the health of your builds and deployments). Grafana is the perfect tool for this purpose and allows you to select multiple data sources (e. g. ElasticSearch and Prometheus).
Alert, we have an error inprod!
Dashboards are a good way to track the status of your system, as long as you have them in front of you at all times! The ideal is therefore to have alerts. These can be delivered by e-mail, on your Slack channels or even by SMS for the most urgent. Tools such as ElastAlert or Watcher can connect to ElasticSearch and allow you to define different alerts. You can, for example, define an alert:
- in case of error
- if the frequency of an event decreases significantly. For example, if the number of connections per minute decreases significantly compared to normal, your users may have difficulty connecting and therefore indicate a problem on your platform
- if the frequency of an event increases sharply (spike). For example, if the number of GCs per minute increases significantly, maybe your application has a memory leak and will end up in Out of Memory.
- If a value changes between two events, such as email changes to verify that there is no hacking attempt.
So, better than getting the information, it is the information that comes to you.
These alerts are mainly very useful for error logs and drastically improve the quality of your application. Nevertheless, alerting can be quite difficult to implement on an existing application because you are quickly drowned in alerts. Often these are filled with what can be sarcastically called “errors of good functioning”, those errors that no one cares about because in the end “it works”. They will finally have to be addressed either by solving them (these are often signs of a real problem) or by passing them by changing their criticality level. It is also the time to ask yourself real questions about the quality of logs and their levels.
Examples of questions to ask:
- are your logs understandable for everyone?
- do they have enough information for their analysis as the exception trace?
- is this trace useful or does it affect readability?
- is this event an error or a warning? For example, if a user is not found in your system, is it a real error requiring an investigation and therefore an alert? Or a very reasonable operation that can be indicated in the form of a warning in order to allow the possibility of following unsuccessful connection attempts?
Improving the quality of observability of your platform requires a constant investment like the quality of the code!
Schrödinger’s Distributed System
Another important element of production is the “healthcheck” strategy, especially in the context of a “Container as a Service” production. As part of a monolith, you easily know if your application is alive or not. In a distributed system, it is more complex. Each instance being autonomous, the system must be able to automatically determine the status of this instance. You can’t afford to wait for a user to notify you when the application stops working. Thanks to healthchecks, a service orchestration system can automatically restart an application in poor health.
In the Spring Boot ecosystem, the Actuator library allows you to easily set up an endpoint /heath to know the health status of your application. And by health status, we check the microservice itself as well as its dependencies such as its database or its message broker. A good healthcheck is therefore transitive and that is where the difficulty lies. In the case of a service orchestrator, restarting a microservice that no longer has a database will not fix the problem… You will therefore have to adapt your recovery strategy according to the actual cause of the malfunction, or if necessary, intervene manually.
We see that managing a production of microservices is much more demanding than managing the typical monolith. Here again, a real devop culture is needed to avoid being overwhelmed. If you are in a company that still separates production and dev, the production team will quickly hate this microservice approach and all the trouble it causes them. Clearly, collaboration will have to be at its maximum, or even directly make the devs responsible for the production (the famous “You build it, you run it” by Amazon). In this case, the production team refocuses on the implementation of new deployment, monitoring and even testing tools such as the practice of “chaos engineering”. This practice aims to stress a distributed system by simulating possible failures, ranging from server shutdown to data center shutdown. This type of test is intended to build confidence in the system’s ability to withstand harsh production conditions.
In this article, we have identified many of the concepts and practices we believe are necessary to effectively manage a microservices architecture. Even if most of its elements (such as tests, continuous delivery or monitoring) are relevant for a monolith, the complexity and necessary energy to maintain the whole platform takes on a completely different dimension.
So do we have to do microservices? If we follow the trend then “yes”, and some will even say that you kill kittens if you don’t! (see this forgotten tweet by Josh Long, Spring Cloud Evangelist)
On our side, we are more moderate. We agree with the opinions of Sam Newman, Martin Fowler and Simon Brown. If the microservice architecture offers great advantages, particularly in terms of agility, scalability, learning and development distribution, it seems more reasonable to us to start with a monolith adopting a very clean architecture and allowing a fast independent service extraction (the term that appears is “modular monolith”). This requires a lot of techniques, practices, communication, and will require an organization that is up to the task. As Martin Fowler says, “you must be this tall to do microservices”.
Moreover, it is much riskier to start a microservice architecture from scratch than from an existing monolith.
This architecture will probably continue to take a major place in our daily work and we hope that the advice we have given will allow you to handle them easily.
Our opinion on this type of architecture is that it is not easy to write a distributed system and it is probably preferable to start with a modular monolith in order to better understand your business domains and technical needs. Setting up microservices can create many more problems for you than it solves. And let’s be honest, few business contexts can really justify the use of such an architecture.
Previous part are there: