Part 5 – Spinnaker for Continuous Delivery and Automated Canary Analysis
In the previous articles in this series, we talked about the advantages of adopting microservice architecture, how to develop and deploy the microservice based applications using Docker and Kubernetes, how we created our Kubernetes clusters on AWS and why and how we implemented Istio service mesh.
As we discussed in the first article, one of the major advantages of adopting the microservices architecture is the ability to perform faster code deployments. This is made possible because individual microservices have well defined boundaries and modifying the internals of a single microservice will not affect outside dependencies in most cases. This increases the release velocity and adds a lot of agility to the engineering team.
Yahoo Small Business’s existing deployment infrastructure was using a combination of Jenkins for continuous integration and Chef for the actual deployments to different environments. Continuous Integration or CI is the process that facilitates the frequent integration of code from multiple developers into one project using automation. This process assures that the frequent code commits land into the testing environment for quality checking and that continuous feedback is being provided to developers about the quality of the software that is being developed. In our case, Jenkins was working well for this part of the software release process. But for continuous delivery – the ability to get changes of all types into production safely, quickly and in a sustainable way – the Chef recipes were not ideal.
Chef is a useful configuration management tool and is a good tool for configuring servers. It uses a paradigm called “mutable infrastructure” wherein it changes the state of a system once it is provisioned. This is done by modifying the system in place to deploy a new version of the software or make a configuration change. But as we mentioned in the second part of our series, in the microservices and Kubernetes world we wanted to have an “immutable software delivery” paradigm in which we use pre-baked infrastructure components when we want to make changes in our environment. In the case of Kubernetes, this is done by creating new pods that house containers that have the latest changes and switching the traffic over to them before destroying the old pods.
Chef based deployment system also did not provide an intuitive user interface to allow us to track the status of its deployments. Chef also requires custom scripting which has to be done by our developers. There is also no native Kubernetes support in Chef.
Another feature required for high-velocity deployments is safety. Before releasing a new version of the code, even if we do extensive testing in our QA and stage environments, there could be bugs that might be exposed when the code is deployed in the production environment and serving real traffic. To catch these kinds of issues, the industry is rapidly adopting a deployment methodology called canary deployment. As mentioned in the previous article, the word canary comes from the practice of using canary birds in coal mines. Coal miners brought canaries into the mines as an early-warning signal for toxic gases, primarily carbon monoxide. The birds, being more sensitive, would become sick before the miners, who would then have a chance to escape or put on protective respirators.
In the software deployment world, a canary release is a technique intended to reduce the risk of introducing a new software version into production by slowly rolling out the change to a small subset of users, with checkpoints along the way. This method allows us to examine how the change is performing before rolling it out to the entire user base. The quality of the canary version is assessed by comparing key metrics that describe the behavior of the old and new versions. If there is a significant degradation in these metrics, the canary is aborted and all of the traffic is routed to the stable version in an effort to minimize the impact of unexpected behavior.
Istio – the service mesh – provides a way for performing these gradual traffic shifts by utilizing its traffic routing capabilities. We can create Istio virtual services and destination rules that have traffic percentages that will send partial traffic to the new software version. But this requires manual modification of these parameters during different stages of the deployment process. We wanted to have an external deployment tool do this for us in an automated way.
We started investigating the ideal solution to accomplish these requirements and came across a tool called Spinnaker.
Spinnaker is a continuous delivery platform developed by Netflix which they open sourced in 2015. Netflix uses Spinnaker to do thousands of deployments a day in their AWS infrastructure. Later Google joined the open source project and started contributing and using this tool. Their main focus was providing first-class support for Kubernetes in Spinnaker. Now there is a very active open source community around the Spinnaker project.
Spinnaker provides the ability to do deployments in multiple types of cloud environments. With recent versions, it provides native integrations with Kubernetes – we can manage Kubernetes objects like pod or service directly from Spinnaker. It comes with many deployment best practices built-in since it is battle tested in production at many big companies. Spinnaker also allows us to automate the deployment pipeline end to end without any additional scripting.
Another major advantage of Spinnaker is its in-built support for multiple deployment methods including canary deployments. By integrating Spinnaker with Istio and any supported external monitoring systems, we can do automated canary analysis to make sure the new version of our microservice is safe before serving it to 100% of our users.
Spinnaker in itself is built as a combination of multiple independent microservices. A microservice called Deck provides the browser based user interface for Spinnaker. Another service called Gate acts as an API gateway for the UI and other external services to talk to the rest of the Spinnaker backend microservices. Authorization to the Spinnaker cluster is done by a service named Fiat. Echo service is used to send notifications to communication channels like Slack or email. For integrating with cloud infrastructures like Kubernetes or AWS, the cloud driver service is used. Integration with Jenkins is done through Igor service. Orca is the orchestration engine that stores the pipeline details in the backend DB like Redis and distributes the workload evenly through a queue. Another major component that is very relevant for us is the Kayenta service which provides automated canary analysis.
For Spinnaker installation in our environment, we used the helm template engine. We already had set up a management Kubernetes cluster which is separate from our QA, stage and production clusters so we set up the Spinnaker ecosystem there. We did some customization around SAML authentication and Jenkins access for triggering smoke jobs. We also configured it to talk to our non-production and production Kubernetes clusters and made sure it pulled docker images from our internal docker registry.
For the Continuous Integration part we continued to use the existing Jenkins pipeline where a code commit from our developers triggers the pipeline which compiles the code, executes the unit tests, gets verified by the SonarQube quality checks and publishes the docker to our local docker registry. The Spinnaker pipeline is designed to poll the docker registry in short intervals. As soon as there is a new image in the docker registry, the pipeline gets triggered. The pipeline deploys the new image to the QA cluster and triggers relevant QA tests to see if the image is good.
To allow the smoke job to test the new image, the virtual service of Istio is configured to route 100% of the traffic to the new version. Once the QA tests are successful then the same docker is tested in the stage environment before it gets to production. There are environment based smoke tests in each of these environments.
During the deployment in the production environment, the automated canary analysis gets activated. This is done by combining the efforts of the Spinnaker pipeline, Kayenta, Istio and a metrics-based monitoring tool called Prometheus. When we deploy the new version of our microservice into the production environment, Spinnaker spins up one Kubernetes pod to run the new version which is treated as a canary. Spinnaker also spins up an additional pod containing the current production version which is treated as the baseline – this acts as the reference point for comparing the metrics with the canary. Spinning up a new baseline pod helps us to have an apple to apple comparison. Once we have these, Spinnaker starts sending a small percentage of traffic to both of these new pods by using Istio traffic routing. The Prometheus monitoring infrastructure keeps pulling service related metrics from both the baseline and the canary and feeds it to Kayenta which is responsible for doing the automated canary analysis. The metrics which need to be evaluated are chosen in advance to determine if the new system is operating at least as well as the old system. Kayenta does its statistical analysis on regular intervals to evaluate the safety of the new software. The primary metric comparison algorithm in Kayenta is a nonparametric statistical test to check for a significant difference between the canary and baseline metrics. Some of the metrics we use at Yahoo Small Business as inputs for Kayenta are http response latency, http error rate and hardware utilization.
Automated canary analysis
Once the analysis is complete, Kayenta informs the Spinnaker pipeline if the analysis is pass or fail. If it passes, we roll out the changes for all of our users in production. Otherwise, the traffic is redirected 100% back to the original version. Then in a clean-up phase, it removes the canary and baseline pods.
At Yahoo Small Business we are a big proponent for shift left software development practices, where we try our best to identify any problems with the software as early in the development lifecycle as possible. With this in mind we have implemented the automated canary analysis in the QA and stage environments as well. Once the canary deployment is done in these environments we send synthetic traffic, similar to our production traffic, using Jenkins to generate enough metrics for doing automated canary analysis.
As you have seen above, one of the major components of this setup is the metric-based monitoring system called Prometheus. In the next part of the series, we will cover more about this tool. We will also touch upon the application log monitoring infrastructure we implemented for the new microservices environment using Splunk.