At Blibli, an Indonesian business-to-consumer Ecommerce provider, we run most of our IT infrastructure— including both stateful and stateless applications such as Redis, RabbitMQ, Spring Boot, Jenkins, and Grafana—on Google Kubernetes Engine (GKE).
GKE provides a scalable and reliable managed service of Kubernetes. It integrates well with other Google Cloud services. And on GKE, we’re now saving more than 30% of infrastructure costs. But like many companies with a lot on their plate and multiple tasks underway, we were once too busy to focus on operations like cluster and node pool updates. Consequently, we fell way behind the current version in GKE.
To comply with both service-provider and open-source software policies, you must stay on top of version updates. And since Google Cloud releases its Kubernetes clusters in three-month cycles, this can be a challenge when running workloads in GKE. But recently, we updated our GKE cluster from version 1.13.x to 1.15.x and tested the same update across different clusters and environments—without service interruptions.
You can read the release notes and changelogs of the version you plan to upgrade to, so I won’t belabor every detail of our update process. But read on to learn how we keep our GKE clusters up to date with newly released versions and how you can too.
Updating a GKE cluster without downtime
We manage our GKE cluster and everything related to it using Terraform and GitOps, which help to simplify the update process.
With a regional cluster, you can avoid downtime because GKE maintains replicas of the control plane across all the zones. So your cluster is resilient to single-zone failure. Double check the resource availability for this activity.
Updating a cluster is a two-step process: First, control plane then node pools, which require a handful of critical network considerations.
The control plane in Kubernetes includes the Kubernetes API server, the scheduler, and the controller manager server. The control-plane upgrade is quite simple since GKE manages it for you with a simple click (in our case, changing the variable in Terraform). This update takes several minutes during which you won’t be able to change the cluster’s configuration. But your workloads will function perfectly.
By default, a cluster’s nodes have auto-update enabled, and Google Cloud recommends that you keep it that way. If you’ve opted for auto-update then GKE does the magic for you. You can just sit back and relax.
Unlike your control-plane update, the process has a lot of visibility. It is also highly dependent on the total number of nodes in the cluster. Sequentially, for each node in the node pool, nodes are stopped from scheduling node Pods, existing Pods are drained, and finally, the node is updated.
Like us, if you need to carefully manage dependencies and qualifications, you may elect to manage your own upgrade workflow. Surge upgrades let you control the number of nodes GKE can update at a time and control how disruptive updates are to your workloads. There are also several options when you decide to update the worker nodes. One obvious way is to manually trigger the update, which parallels the auto-update process except you decide when it occurs.
gcloud container clusters upgrade cluster-name –node-pool=node-pool-name –cluster-version cluster-version
Fun fact: You can manually update a node pool version to match the version of the control plane or a previous version that is still available and compatible with the control plane. The Kubernetes version and version skew support policy guarantees that control planes are compatible with nodes up to two minor versions older than the control plane. For example, Kubernetes 1.13.x control planes are compatible with Kubernetes 1.11.x nodes.
Although GKE can update large clusters quickly, we thought that when running over 100 nodes and not using surge upgrades in GKE, it might take forever to drain all the nodes and upgrade. So our strategy here was to depend on the continuous deployment of our application and also to avoid downtime.
Here comes the interesting implementation part you were looking for. Rather than updating the current nodes of a node pool, we created a new node pool with the updated version—but with a twist. The new node pool had different taints relative to the node pool with the old version. Can you guess the next step? It’s simple: We deployed our applications matching the taints of the node pool with the new version. Again, you can too.
But (there’s always a “but”) to prevent downtime, you need to ensure that your update strategy is a rolling one. And you must confirm a couple other things before deploying the applications.
Since the node pool you just created has a different taint, the first thing you need to ensure once the nodes in the new node pool are spawned is that the DaemonSets are deployed and running perfectly.
Pod Disruption Budget (PDB)
PDB is a mechanism by which you allow for a number/percentage of pods to be terminated. Since the number of replica and PDB go hand in hand, we set the PDB for our workloads to maxUnavailable: 1. This gives the confidence that at any point in the application deployment at least one Pod is running.