A small example: Company XYZ understood the benefits of microservices, automated testing, containers and CI/CD and implemented them. They decided to split their software monolith into Module A and Module B, which run as independent services. Furthermore, a Module C with some new and independent functions. These modules were packed into containers and deployed under Kubernetes. When they started to deploy these modules in production, they noticed the first problems:
- Sometimes some services were very inconsistent in their response behavior and took a very long time to process requests.
- Some services stopped responding under load
- If Service B could not answer requests then Service A did not answer either, but only sometimes and for certain requests.
- During deployment, bugs crept in that were not caught during the automated tests. The deployments were carried out in what is known as blue/green deployment. Here, the new deployment is started completely in separate containers („green“) and when the deployment was completed and the containers were started, the traffic was completely switched from the old („blue“) containers to the new ones. They had hoped that blue-green deployments would reduce the risk of deployments, but in practice it smelled like „big-bang“ releases again; which is what they actually wanted to avoid. At least that’s what the consultants said 😉
- The teams of Service A and B each had a completely different implementation of security. Team A wanted secure connections with certificates and private keys. Team B favored its own implementation with servlet interceptors. Team C did not take any security measures at all because these were internal services behind a firewall.
XYZ started to get nervous and started to blame the consultants who had promised the brave new world. But the problems are not specific to XYZ, but inherent in service-oriented systems:
- Prevent errors from spreading across isolation boundaries
- Implement applications that can cope with „change“ in the environment
- Develop applications that still work under partially failed systems
- Understanding what is happening in an overall system that is subject to constant change
- Inability to change the runtime behavior of systems
- Implementing strong security in an environment where the attack surface is constantly growing
- Reduce the risk of changes to the system
- Enforcing policies about who/what/when uses the components in the system
These are the main problems of building service-oriented architectures on cloud infrastructure. These challenges existed in the pre-cloud era, but the many moving parts of cloud infrastructures make the problems worse.
Cloud infrastructure is not reliable
Even though we as cloud users don’t see the actual hardware, cloud environments consist of a huge amount of moving parts of hardware and software. These components are virtualized compute, storage and network resources that we can provision via self-service API. And: each of these components can and will fail. In the past, we have focused on making infrastructures fail-safe and building our applications on them with assumptions about availability and reliability. In the cloud: no way! It doesn’t work. Components can and will fail; all the time. But what is the solution then?
Let’s say Service A calls Service B and due to some problem Service B does not respond. This could be an infrastructure problem or something completely different. Service A cannot know what the problem is. Service A can try the following things:
- Client side load balancing – the client has multiple endpoints that it can call.
- Service Discovery – a mechanism to regularly find functional endpoints
- Circuit Breaking – Suspend a request to a service if it has problems
- Bulk Heading – throttle requests to a service when it has problems
- Timeouts – time limit for connections, threads, sessions
- Retries – Repeat a request if it could not be sent
- Retry Budgets – Restrictions on retries, eg 10 retries in 1 minute
- Deadlines – Provide a context that specifies how long an answer can still be valid
This list has a lot in common with the mechanisms in a packet network such as TCP/IP, except that the measures are intended to take effect at the „message“ level and not at the packet level.
Understanding in real time
What we also want to understand is: What is going on in our service landscape, in real time. Which service is talking to whom. What is a typical, average response time for service B? We need logs, metrics and traces to understand how services interact with each other.
Application libraries
The first companies to start running their applications in the cloud were large Internet companies. They invested a lot of time and money to build software libraries that ensure the above-mentioned requirements of services in the cloud. For example, Netflix built:
- Hystrix – Circuit Breaker
- Ribbon – Client Load Balancing
- Eureka – Service Discovery
- Zuul – Edge Proxy
These are used in many Java-based applications in the cloud. But these have some disadvantages: they are usually not language-neutral. The libraries created by Netflix can only be used by Java applications and they are not trivial to implement as a whole. And the word says it all: implement, so you have to put the requirements into code using these libraries. The idea is now to bring the matter to the infrastructure level. So away from the developers and towards the DevOps colleagues who implement it declaratively for the applications. And that brings us to the topic of service mesh. Implementations here are Istio or Linkerd . All of the problems mentioned can be solved and presented transparently with a service mesh. The developer does not have to worry about these issues and can concentrate on business logic. Service meshes certainly deserve their own articles. This brings this series of articles on digital transformation to an end on a technical level. If you need help with implementation or advice on the technologies and procedures mentioned or on the introduction of „DevOps“ in the company, please contact us.