Kubernetes Canary Deployments with GitHub Actions

In the previous post, we looked at some of the GitHub Actions you can use with Microsoft Azure. One of those actions is the azure/k8s-deploy action which is currently at v1.4 (January 2021). To use that action, include the following snippet in your workflow:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}

The above snippet uses baked manifests from an earlier azure/k8s-bake action that uses kustomize as the render engine. This is optional and you can use individual manifests without kustomize. It also replaces the image it finds in the baked manifest with an image that includes a specific tag that is set as a variable at the top of the workflow. Multiple images can be replaced if necessary.

The azure/k8s-deploy action supports different styles of deployments, defined by the strategy action input:

  • none: if you do not specify a strategy, a standard Kubernetes rolling update is performed
  • canary: deploy a new version and direct a part of the traffic to the new version; you need to set a percentage action input to control the traffic split; in essence, a percentage of your users will use the new version
  • blue-green: deploy a new version next to the old version; after testing the new version, you can switch traffic to the new version

In this post, we will only look at the canary deployment. If you read the description above, it should be clear that we need a way to split the traffic. There are several ways to do this, via the traffic-split-method action input:

  • pod: the default value; by tweaking the amount of “old version” and “new version” pods, the standard load balancing of a Kubernetes service will approximate the percentage you set; pod uses standard Kubernetes features so no additional software is needed
  • smi: you will need to implement a service mesh that supports TrafficSplit; the traffic split is controlled at the request level by the service mesh and will be precise

Although pod traffic split is the easiest to use and does not require additional software, it is not very precise. In general, I recommend using TrafficSplit in combination with a service mesh like linkerd, which is very easy to implement. Other options are Open Service Mesh and of course, Istio.

With this out of the way, let’s see how we can implement it on a standard Azure Kubernetes Service (AKS) cluster.

Installing linkerd

Installing linkerd is easy. First install the cli on your system:

curl -sL https://run.linkerd.io/install | sh

Alternatively, use brew to install it:

brew install linkerd

Next, with kubectl configured to connect to your AKS cluster, run the following commands:

linkerd check --pre
linkerd install | kubectl apply -f -
linkerd check

Preparing the manifests

We will use three manifests, in combination with kustomize. You can find them on GitHub. In namespace.yaml, the linkerd.io/inject annotation ensures that the entire namespace is meshed. Every pod you create will get the linkerd sidecar injected, which is required for traffic splitting.

In the GitHub workflow, the manifests will be “baked” with kustomize. The result will be one manifest file:

- uses: azure/k8s-bake@v1
  with:
    renderEngine: kustomize
    kustomizationPath: ./deploy/
  id: bake

The action above requires an id. We will use that id to refer to the resulting manifest later with:

${{ steps.bake.outputs.manifestsBundle }}

Important note: I had some trouble using the baked manifest and later switched to using the individual manifests; I also deployed namespace.yaml in one action and then deployed service.yaml and deployment.yaml is a separate action; to deploy multiple manifests, use the following syntax:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: |
      ./deploy/service.yaml
      ./deploy/deployment.yaml

First run

We got to start somewhere so we will deploy version 0.0.1 of the ghcr.io/gbaeke/go-template image. In the deployment workflow, we set the IMAGE_TAG variable to 0.0.1 and have the following action:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    # or use individual manifests in case of issues 🙂
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
    strategy: canary
    traffic-split-method: smi
    action: deploy #deploy is the default; we will later use this to promote/reject
    percentage: 20
    baseline-and-canary-replicas: 2

Above, the action inputs set the canary strategy, using the smi method with 20% of traffic to the new version. The deploy action is used which results in “canary” pods of version 0.0.1. It’s not actually a canary because there is no stable deployment yet and all traffic goes to the “canary”.

This is what gets deployed:

Initial, canary-only release

There only is a canary deployment with 2 canary pods in the deployment (we asked for 2 explicitly in the action). There are four services: the main go-template-service and then a service for baseline, canary and stable. Instead of deploy, you can use promote directly (action: promote) to deploy a stable version right away.

If we run linkerd dashboard we can check the namespace and the Traffic Split:

TrafficSplit in linkerd; all traffic to canary

Looked at in another way:

TrafficSplit

All traffic goes to the canary. In the Kubernetes TrafficSplit object, the weight is actually set to 1000m which is shown as 1 above.

Promotion

We can now modify the pipeline, change the action input action of azure/k8s-deploy to promote and trigger the workflow to run. This is what the action should look like:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
    strategy: canary
    traffic-split-method: smi
    action: promote  #deploy is the default; we will later use this to promote/reject
    percentage: 20
    baseline-and-canary-replicas: 2

This is the result of the promotion:

After promotion

As expected, we now have 5 pods of the 0.0.1 deployment. This is the stable deployment. The canary pods have been removed. We get five pods because that is the number of replicas in deployment.yaml. The baseline-and-canary-replicas action input is not relevant now as there are no canary and baseline deployments.

The TrafficSplit now directs 100% of traffic to the stable service:

All traffic to “promoted” stable service

Deploying v0.0.2 with 20% split

Now we can deploy a new version of our app, version 0.0.2. The action is the same as the initial deploy but IMAGE_TAG is set to 0.0.2:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
    strategy: canary
    traffic-split-method: smi
    action: deploy #deploy is the default; we will later use this to promote/reject
    percentage: 20
    baseline-and-canary-replicas: 2

Running this action results in:

Canary deployment of 0.0.2

The stable version still has 5 pods but canary and baseline pods have been added. More info about baseline below.

TrafficSplit is now:

TrafficSplit: 80% to stable and 20% to baseline & canary

Note that the baseline pods uses the same version as the stable pods (here 0.0.1). The baseline should be used to compare metrics with the canary version. You should not compare the canary to stable because factors such as caching might influence the comparison. This also means that, instead of 20%, only 10% of traffic goes to the new version.

Wait… I have to change the pipeline to promote/reject?

Above, we manually changed the pipeline and ran it manually from VS Code or the GitHub website. This is possible with triggers such as repository_dispatch and workflow_dispatch. There are (or should be) some ways to automate this better:

  • GitHub environments: with Azure DevOps, it is possible to run jobs based on environments and the approve/reject status; I am still trying to figure out if this is possible with GitHub Actions but it does not look like it (yet); if you know, drop it in the comments; I will update this post if there is a good solution
  • workflow_dispatch inputs: if you do want to run the workflow manually, you can use workflow_dispatch inputs to approve/reject or do nothing

Should you use this?

While I think the GitHub Action works well, I am not in favor of driving all this from GitHub, Azure DevOps and similar solutions. There’s just not enough control imho.

Solutions such as flagger or Argo Rollouts are progressive delivery operators that run inside the Kubernetes cluster. They provide more operational control, are fully automated and can be integrated with Prometheus and/or service mesh metrics. For an example, check out one of my videos. When you need canary and/or blue-green releases and you are looking to integrate the progression of your release based on metrics, surely check them out. They also work well for manual promotion via a CLI or UI if you do not need metrics-based promotion.

Conclusion

In this post we looked at the “mechanics” of canary deployments with GitHub Actions. An end-to-end solution, fully automated and based on metrics, in a more complex production application is quite challenging. If your particular application can use simpler deployment methods such as standard Kubernetes deployments or even blue-green, then use those!

A look at GitHub Actions for Azure and AKS deployments

In the past, I wrote about using Azure DevOps to deploy an AKS cluster and bootstrap it with Flux v2, a GitOps solution. In an older post, I also described bootstrapping the cluster with Helm deployments from the pipeline.

In this post, we will take a look at doing the above with GitHub Actions. Along the way, we will look at a VS Code extension for GitHub Actions, manually triggering a workflow from VS Code and GitHub and manifest deployment to AKS.

Let’s dive is, shall we?

Getting ready

What do you need to follow along:

Deploying AKS

Although you can deploy Azure Kubernetes Service (AKS) in many ways (manual, CLI, ARM, Terraform, …), we will use ARM and the azure/arm-deploy@v1 action in a workflow we can trigger manually. The workflow (without the Flux bootstrap section) is shown below:

name: deploy

on:
  repository_dispatch:
    types: [deploy]
  workflow_dispatch:
    

env:
  CLUSTER_NAME: CLUSTERNAME
  RESOURCE_GROUP: RESOURCEGROUP
  KEYVAULT: KVNAME
  GITHUB_OWNER: GUTHUBUSER
  REPO: FLUXREPO


jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - uses: azure/arm-deploy@v1
        with:
          subscriptionId: ${{ secrets.SUBSCRIPTION_ID }}
          resourceGroupName: rg-gitops-demo
          template: ./aks/deploy.json
          parameters: ./aks/deployparams.json

      - uses: azure/setup-kubectl@v1
        with:
          version: 'v1.18.8'

      - uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: ${{ env.CLUSTER_NAME }}
          resource-group: ${{ env.RESOURCE_GROUP }}

To create this workflow, add a .yml file (e.g. deploy.yml) to the .github/workflows folder of the repository. You can add this directly from the GitHub website or use VS Code to create the file and push it to GitHub.

The above workflow uses several of the Azure GitHub Actions, starting with the login. The azure/login@v1 action requires a GitHub secret that I called AZURE_CREDENTIALS. You can set secrets in your repository settings. If you use an organization, you can make it an organization secret.

GitHub Repository Secrets

If you have the GitHub Actions VS Code extension, you can also set them from there:

Setting and reading the secrets from VS Code

If you use the gh command line, you can use the command below from the local repository folder:

gh secret set SECRETNAME --body SECRETVALUE

The VS Code integration and the gh command line make it easy to work with secrets from your local system rather than having to go to the GitHub website.

The secret should contain the full JSON response of the following Azure CLI command:

az ad sp create-for-rbac --name "sp-name" --sdk-auth --role ROLE \
     --scopes /subscriptions/SUBID

The above command creates a service principal and gives it a role at the subscription level. That role could be contributor, reader, or other roles. In this case, contributor will do the trick. Of course, you can decide to limit the scope to a lower level such as a resource group.

After a successful login, we can use an ARM template to deploy AKS with the azure/arm-deploy@v1 action:

      - uses: azure/arm-deploy@v1
        with:
          subscriptionId: ${{ secrets.SUBSCRIPTION_ID }}
          resourceGroupName: rg-gitops-demo
          template: ./aks/deploy.json
          parameters: ./aks/deployparams.json

The action’s parameters are self-explanatory. For an example of an ARM template and parameters to deploy AKS, check out this example. I put my template in the aks folder of the GitHub repository. Of course, you can deploy anything you want with this action. AKS is merely an example.

When the cluster is deployed, we can download a specific version of kubectl to the GitHub runner that executes the workflow. For instance:

     - uses: azure/setup-kubectl@v1
        with:
          version: 'v1.18.8'

Note that the Ubuntu GitHub runner (we use ubuntu-latest here) already contains kubectl version 1.19 at the time of writing. The azure/setup-kubectl@v1 is useful if you want to use a specific version. In this specific case, the azure/setup-kubectl@v1 action is not really required.

Now we can obtain credentials to our AKS cluster with the azure/aks-set-context@v1 task. We can use the same credentials secret, in combination with the cluster name and resource group set as a workflow environment variable:

      - uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: ${{ env.CLUSTER_NAME }}
          resource-group: ${{ env.RESOURCE_GROUP }}

In this case, the AKS API server has a public endpoint. When you use a private endpoint, run the GitHub workflow on a self-hosted runner with network access to the private API server.

Bootstrapping with Flux v2

To bootstrap the cluster with tools like nginx and cert-manager, Flux v2 is used. The commands used in the original Azure DevOps pipeline can be reused:

- name: Flux bootstrap
        run: |
          export GITHUB_TOKEN=${{ secrets.GH_TOKEN }}
          msi="$(az aks show -n ${{ env.CLUSTER_NAME }} -g ${{ env.RESOURCE_GROUP }} --query identityProfile.kubeletidentity.objectId -o tsv)"
          az keyvault set-policy --name ${{ env.KEYVAULT }} --object-id $msi --secret-permissions get
          curl -s https://toolkit.fluxcd.io/install.sh | bash
          flux bootstrap github --owner=${{ env.GITHUB_OWNER }} --repository=${{ env.REPO }} --branch=main --path=demo-cluster --personal

For an explanation of these commands, check this post.

Running the workflow manually

As noted earlier, we want to be able to run the workflow from the GitHub Actions extension in VS Code and the GitHub website instead of pushes or pull requests. The following triggers make this happen:

on:
  repository_dispatch:
    types: [deploy]
  workflow_dispatch:

The VS Code extension requires the repository_dispatch trigger. Because I am using multiple workflows in the same repo with this trigger, I use a unique event type per workflow. In this case, the type is deploy. To run the workflow, just right click on the workflow in VS Code:

Running the workflow from VS Code

You will be asked for the event to trigger and then the event type:

Selecting the deploy event type

The workflow will now be run. Progress can be tracked from VS Code:

Tracking workflow runs

Update Jan 7th 2021: after writing this post, the GitHub Action extension was updated to also support workflow_dispatch which means you can use workflow_dispatch to trigger the workflow from both VS Code and the GitHub website ⬇⬇⬇

To run the workflow from the GitHub website, workflow_dispatch is used. On GitHub, you can then run the workflow from the web UI:

Running the workflow from GitHub

Note that you can specify input parameters to workflow_dispatch. See this doc for more info.

Deploying manifests

As shown above, deploying AKS from a GitHub workflow is rather straightforward. The creation of the ARM template takes more effort. Deploying a workload from manifests is easy to do as well. In the repo, I created a second workflow called app.yml with the following content:

name: deployapp

on:
  repository_dispatch:
    types: [deployapp]
  workflow_dispatch:

env:
  CLUSTER_NAME: clu-gitops
  RESOURCE_GROUP: rg-gitops-demo
  IMAGE_TAG: 0.0.2

jobs:
  deployapp:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: ${{ env.CLUSTER_NAME }}
          resource-group: ${{ env.RESOURCE_GROUP }}

      - uses: azure/container-scan@v0
        with:
          image-name: ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
          run-quality-checks: true

      - uses: azure/k8s-bake@v1
        with:
          renderEngine: kustomize
          kustomizationPath: ./deploy/
        id: bake

      - uses: azure/k8s-deploy@v1
        with:
          namespace: go-template
          manifests: ${{ steps.bake.outputs.manifestsBundle }}
          images: |
            ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}   
          

In the above workflow, the following actions are used:

  • actions/checkout@v2: checkout the code on the GitHub runner
  • azure/aks-set-context@v1: obtain credentials to AKS
  • azure/container-scan@v0: scan the container image we want to deploy; see https://github.com/Azure/container-scan for the types of scan
  • azure/k8s-bake@v1: create one manifest file using kustomize; note that the action uses kubectl kustomize instead of the standalone kustomize executable; the action should refer to a folder that contains a kustomization.yaml file; see this link for an example
  • azure/k8s-deploy@v1: deploy the baked manifest (which is an output from the task with id=bake) to the go-template namespace on the cluster; replace the image to deploy with the image specified in the images list (the tag can be controlled with the workflow environment variable IMAGE_TAG)

Note that the azure/k8s-deploy@v1 task supports canary and blue/green deployments using several techniques for traffic splitting (Kubernetes, Ingress, SMI). In this case, a regular Kubernetes deployment is used, equivalent to kubectl apply -f templatefile.yaml.

Conclusion

I only touched upon a few of the Azure GitHub Actions such as azure/login@v1 and azure/k8s-deploy@v1. There are many more actions available that allow you to deploy to Azure Container Instances, Azure Web App and more. We have also looked at running the workflows from VS Code and the GitHub website, which is easy to do with the repository_dispatch and workflow_dispatch triggers.