Kubernetes Canary Deployments with GitHub Actions

In the previous post, we looked at some of the GitHub Actions you can use with Microsoft Azure. One of those actions is the azure/k8s-deploy action which is currently at v1.4 (January 2021). To use that action, include the following snippet in your workflow:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}

The above snippet uses baked manifests from an earlier azure/k8s-bake action that uses kustomize as the render engine. This is optional and you can use individual manifests without kustomize. It also replaces the image it finds in the baked manifest with an image that includes a specific tag that is set as a variable at the top of the workflow. Multiple images can be replaced if necessary.

The azure/k8s-deploy action supports different styles of deployments, defined by the strategy action input:

  • none: if you do not specify a strategy, a standard Kubernetes rolling update is performed
  • canary: deploy a new version and direct a part of the traffic to the new version; you need to set a percentage action input to control the traffic split; in essence, a percentage of your users will use the new version
  • blue-green: deploy a new version next to the old version; after testing the new version, you can switch traffic to the new version

In this post, we will only look at the canary deployment. If you read the description above, it should be clear that we need a way to split the traffic. There are several ways to do this, via the traffic-split-method action input:

  • pod: the default value; by tweaking the amount of “old version” and “new version” pods, the standard load balancing of a Kubernetes service will approximate the percentage you set; pod uses standard Kubernetes features so no additional software is needed
  • smi: you will need to implement a service mesh that supports TrafficSplit; the traffic split is controlled at the request level by the service mesh and will be precise

Although pod traffic split is the easiest to use and does not require additional software, it is not very precise. In general, I recommend using TrafficSplit in combination with a service mesh like linkerd, which is very easy to implement. Other options are Open Service Mesh and of course, Istio.

With this out of the way, let’s see how we can implement it on a standard Azure Kubernetes Service (AKS) cluster.

Installing linkerd

Installing linkerd is easy. First install the cli on your system:

curl -sL https://run.linkerd.io/install | sh

Alternatively, use brew to install it:

brew install linkerd

Next, with kubectl configured to connect to your AKS cluster, run the following commands:

linkerd check --pre
linkerd install | kubectl apply -f -
linkerd check

Preparing the manifests

We will use three manifests, in combination with kustomize. You can find them on GitHub. In namespace.yaml, the linkerd.io/inject annotation ensures that the entire namespace is meshed. Every pod you create will get the linkerd sidecar injected, which is required for traffic splitting.

In the GitHub workflow, the manifests will be “baked” with kustomize. The result will be one manifest file:

- uses: azure/k8s-bake@v1
  with:
    renderEngine: kustomize
    kustomizationPath: ./deploy/
  id: bake

The action above requires an id. We will use that id to refer to the resulting manifest later with:

${{ steps.bake.outputs.manifestsBundle }}

Important note: I had some trouble using the baked manifest and later switched to using the individual manifests; I also deployed namespace.yaml in one action and then deployed service.yaml and deployment.yaml is a separate action; to deploy multiple manifests, use the following syntax:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: |
      ./deploy/service.yaml
      ./deploy/deployment.yaml

First run

We got to start somewhere so we will deploy version 0.0.1 of the ghcr.io/gbaeke/go-template image. In the deployment workflow, we set the IMAGE_TAG variable to 0.0.1 and have the following action:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    # or use individual manifests in case of issues πŸ™‚
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
    strategy: canary
    traffic-split-method: smi
    action: deploy #deploy is the default; we will later use this to promote/reject
    percentage: 20
    baseline-and-canary-replicas: 2

Above, the action inputs set the canary strategy, using the smi method with 20% of traffic to the new version. The deploy action is used which results in “canary” pods of version 0.0.1. It’s not actually a canary because there is no stable deployment yet and all traffic goes to the “canary”.

This is what gets deployed:

Initial, canary-only release

There only is a canary deployment with 2 canary pods in the deployment (we asked for 2 explicitly in the action). There are four services: the main go-template-service and then a service for baseline, canary and stable. Instead of deploy, you can use promote directly (action: promote) to deploy a stable version right away.

If we run linkerd dashboard we can check the namespace and the Traffic Split:

TrafficSplit in linkerd; all traffic to canary

Looked at in another way:

TrafficSplit

All traffic goes to the canary. In the Kubernetes TrafficSplit object, the weight is actually set to 1000m which is shown as 1 above.

Promotion

We can now modify the pipeline, change the action input action of azure/k8s-deploy to promote and trigger the workflow to run. This is what the action should look like:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
    strategy: canary
    traffic-split-method: smi
    action: promote  #deploy is the default; we will later use this to promote/reject
    percentage: 20
    baseline-and-canary-replicas: 2

This is the result of the promotion:

After promotion

As expected, we now have 5 pods of the 0.0.1 deployment. This is the stable deployment. The canary pods have been removed. We get five pods because that is the number of replicas in deployment.yaml. The baseline-and-canary-replicas action input is not relevant now as there are no canary and baseline deployments.

The TrafficSplit now directs 100% of traffic to the stable service:

All traffic to “promoted” stable service

Deploying v0.0.2 with 20% split

Now we can deploy a new version of our app, version 0.0.2. The action is the same as the initial deploy but IMAGE_TAG is set to 0.0.2:

- uses: azure/k8s-deploy@v1.4
  with:
    namespace: go-template
    manifests: ${{ steps.bake.outputs.manifestsBundle }}
    images: |
      ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
    strategy: canary
    traffic-split-method: smi
    action: deploy #deploy is the default; we will later use this to promote/reject
    percentage: 20
    baseline-and-canary-replicas: 2

Running this action results in:

Canary deployment of 0.0.2

The stable version still has 5 pods but canary and baseline pods have been added. More info about baseline below.

TrafficSplit is now:

TrafficSplit: 80% to stable and 20% to baseline & canary

Note that the baseline pods uses the same version as the stable pods (here 0.0.1). The baseline should be used to compare metrics with the canary version. You should not compare the canary to stable because factors such as caching might influence the comparison. This also means that, instead of 20%, only 10% of traffic goes to the new version.

Wait… I have to change the pipeline to promote/reject?

Above, we manually changed the pipeline and ran it manually from VS Code or the GitHub website. This is possible with triggers such as repository_dispatch and workflow_dispatch. There are (or should be) some ways to automate this better:

  • GitHub environments: with Azure DevOps, it is possible to run jobs based on environments and the approve/reject status; I am still trying to figure out if this is possible with GitHub Actions but it does not look like it (yet); if you know, drop it in the comments; I will update this post if there is a good solution
  • workflow_dispatch inputs: if you do want to run the workflow manually, you can use workflow_dispatch inputs to approve/reject or do nothing

Should you use this?

While I think the GitHub Action works well, I am not in favor of driving all this from GitHub, Azure DevOps and similar solutions. There’s just not enough control imho.

Solutions such as flagger or Argo Rollouts are progressive delivery operators that run inside the Kubernetes cluster. They provide more operational control, are fully automated and can be integrated with Prometheus and/or service mesh metrics. For an example, check out one of my videos. When you need canary and/or blue-green releases and you are looking to integrate the progression of your release based on metrics, surely check them out. They also work well for manual promotion via a CLI or UI if you do not need metrics-based promotion.

Conclusion

In this post we looked at the “mechanics” of canary deployments with GitHub Actions. An end-to-end solution, fully automated and based on metrics, in a more complex production application is quite challenging. If your particular application can use simpler deployment methods such as standard Kubernetes deployments or even blue-green, then use those!

A look at GitHub Actions for Azure and AKS deployments

In the past, I wrote about using Azure DevOps to deploy an AKS cluster and bootstrap it with Flux v2, a GitOps solution. In an older post, I also described bootstrapping the cluster with Helm deployments from the pipeline.

In this post, we will take a look at doing the above with GitHub Actions. Along the way, we will look at a VS Code extension for GitHub Actions, manually triggering a workflow from VS Code and GitHub and manifest deployment to AKS.

Let’s dive is, shall we?

Getting ready

What do you need to follow along:

Deploying AKS

Although you can deploy Azure Kubernetes Service (AKS) in many ways (manual, CLI, ARM, Terraform, …), we will use ARM and the azure/arm-deploy@v1 action in a workflow we can trigger manually. The workflow (without the Flux bootstrap section) is shown below:

name: deploy

on:
  repository_dispatch:
    types: [deploy]
  workflow_dispatch:
    

env:
  CLUSTER_NAME: CLUSTERNAME
  RESOURCE_GROUP: RESOURCEGROUP
  KEYVAULT: KVNAME
  GITHUB_OWNER: GUTHUBUSER
  REPO: FLUXREPO


jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - uses: azure/arm-deploy@v1
        with:
          subscriptionId: ${{ secrets.SUBSCRIPTION_ID }}
          resourceGroupName: rg-gitops-demo
          template: ./aks/deploy.json
          parameters: ./aks/deployparams.json

      - uses: azure/setup-kubectl@v1
        with:
          version: 'v1.18.8'

      - uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: ${{ env.CLUSTER_NAME }}
          resource-group: ${{ env.RESOURCE_GROUP }}

To create this workflow, add a .yml file (e.g. deploy.yml) to the .github/workflows folder of the repository. You can add this directly from the GitHub website or use VS Code to create the file and push it to GitHub.

The above workflow uses several of the Azure GitHub Actions, starting with the login. The azure/login@v1 action requires a GitHub secret that I called AZURE_CREDENTIALS. You can set secrets in your repository settings. If you use an organization, you can make it an organization secret.

GitHub Repository Secrets

If you have the GitHub Actions VS Code extension, you can also set them from there:

Setting and reading the secrets from VS Code

If you use the gh command line, you can use the command below from the local repository folder:

gh secret set SECRETNAME --body SECRETVALUE

The VS Code integration and the gh command line make it easy to work with secrets from your local system rather than having to go to the GitHub website.

The secret should contain the full JSON response of the following Azure CLI command:

az ad sp create-for-rbac --name "sp-name" --sdk-auth --role ROLE \
     --scopes /subscriptions/SUBID

The above command creates a service principal and gives it a role at the subscription level. That role could be contributor, reader, or other roles. In this case, contributor will do the trick. Of course, you can decide to limit the scope to a lower level such as a resource group.

After a successful login, we can use an ARM template to deploy AKS with the azure/arm-deploy@v1 action:

      - uses: azure/arm-deploy@v1
        with:
          subscriptionId: ${{ secrets.SUBSCRIPTION_ID }}
          resourceGroupName: rg-gitops-demo
          template: ./aks/deploy.json
          parameters: ./aks/deployparams.json

The action’s parameters are self-explanatory. For an example of an ARM template and parameters to deploy AKS, check out this example. I put my template in the aks folder of the GitHub repository. Of course, you can deploy anything you want with this action. AKS is merely an example.

When the cluster is deployed, we can download a specific version of kubectl to the GitHub runner that executes the workflow. For instance:

     - uses: azure/setup-kubectl@v1
        with:
          version: 'v1.18.8'

Note that the Ubuntu GitHub runner (we use ubuntu-latest here) already contains kubectl version 1.19 at the time of writing. The azure/setup-kubectl@v1 is useful if you want to use a specific version. In this specific case, the azure/setup-kubectl@v1 action is not really required.

Now we can obtain credentials to our AKS cluster with the azure/aks-set-context@v1 task. We can use the same credentials secret, in combination with the cluster name and resource group set as a workflow environment variable:

      - uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: ${{ env.CLUSTER_NAME }}
          resource-group: ${{ env.RESOURCE_GROUP }}

In this case, the AKS API server has a public endpoint. When you use a private endpoint, run the GitHub workflow on a self-hosted runner with network access to the private API server.

Bootstrapping with Flux v2

To bootstrap the cluster with tools like nginx and cert-manager, Flux v2 is used. The commands used in the original Azure DevOps pipeline can be reused:

- name: Flux bootstrap
        run: |
          export GITHUB_TOKEN=${{ secrets.GH_TOKEN }}
          msi="$(az aks show -n ${{ env.CLUSTER_NAME }} -g ${{ env.RESOURCE_GROUP }} --query identityProfile.kubeletidentity.objectId -o tsv)"
          az keyvault set-policy --name ${{ env.KEYVAULT }} --object-id $msi --secret-permissions get
          curl -s https://toolkit.fluxcd.io/install.sh | bash
          flux bootstrap github --owner=${{ env.GITHUB_OWNER }} --repository=${{ env.REPO }} --branch=main --path=demo-cluster --personal

For an explanation of these commands, check this post.

Running the workflow manually

As noted earlier, we want to be able to run the workflow from the GitHub Actions extension in VS Code and the GitHub website instead of pushes or pull requests. The following triggers make this happen:

on:
  repository_dispatch:
    types: [deploy]
  workflow_dispatch:

The VS Code extension requires the repository_dispatch trigger. Because I am using multiple workflows in the same repo with this trigger, I use a unique event type per workflow. In this case, the type is deploy. To run the workflow, just right click on the workflow in VS Code:

Running the workflow from VS Code

You will be asked for the event to trigger and then the event type:

Selecting the deploy event type

The workflow will now be run. Progress can be tracked from VS Code:

Tracking workflow runs

Update Jan 7th 2021: after writing this post, the GitHub Action extension was updated to also support workflow_dispatch which means you can use workflow_dispatch to trigger the workflow from both VS Code and the GitHub website ⬇⬇⬇

To run the workflow from the GitHub website, workflow_dispatch is used. On GitHub, you can then run the workflow from the web UI:

Running the workflow from GitHub

Note that you can specify input parameters to workflow_dispatch. See this doc for more info.

Deploying manifests

As shown above, deploying AKS from a GitHub workflow is rather straightforward. The creation of the ARM template takes more effort. Deploying a workload from manifests is easy to do as well. In the repo, I created a second workflow called app.yml with the following content:

name: deployapp

on:
  repository_dispatch:
    types: [deployapp]
  workflow_dispatch:

env:
  CLUSTER_NAME: clu-gitops
  RESOURCE_GROUP: rg-gitops-demo
  IMAGE_TAG: 0.0.2

jobs:
  deployapp:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: ${{ env.CLUSTER_NAME }}
          resource-group: ${{ env.RESOURCE_GROUP }}

      - uses: azure/container-scan@v0
        with:
          image-name: ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}
          run-quality-checks: true

      - uses: azure/k8s-bake@v1
        with:
          renderEngine: kustomize
          kustomizationPath: ./deploy/
        id: bake

      - uses: azure/k8s-deploy@v1
        with:
          namespace: go-template
          manifests: ${{ steps.bake.outputs.manifestsBundle }}
          images: |
            ghcr.io/gbaeke/go-template:${{ env.IMAGE_TAG }}   
          

In the above workflow, the following actions are used:

  • actions/checkout@v2: checkout the code on the GitHub runner
  • azure/aks-set-context@v1: obtain credentials to AKS
  • azure/container-scan@v0: scan the container image we want to deploy; see https://github.com/Azure/container-scan for the types of scan
  • azure/k8s-bake@v1: create one manifest file using kustomize; note that the action uses kubectl kustomize instead of the standalone kustomize executable; the action should refer to a folder that contains a kustomization.yaml file; see this link for an example
  • azure/k8s-deploy@v1: deploy the baked manifest (which is an output from the task with id=bake) to the go-template namespace on the cluster; replace the image to deploy with the image specified in the images list (the tag can be controlled with the workflow environment variable IMAGE_TAG)

Note that the azure/k8s-deploy@v1 task supports canary and blue/green deployments using several techniques for traffic splitting (Kubernetes, Ingress, SMI). In this case, a regular Kubernetes deployment is used, equivalent to kubectl apply -f templatefile.yaml.

Conclusion

I only touched upon a few of the Azure GitHub Actions such as azure/login@v1 and azure/k8s-deploy@v1. There are many more actions available that allow you to deploy to Azure Container Instances, Azure Web App and more. We have also looked at running the workflows from VS Code and the GitHub website, which is easy to do with the repository_dispatch and workflow_dispatch triggers.

Azure Application Gateway and Cloudflare

I often work with customers that build web applications on cloud platforms like Azure, AWS or Digital Ocean. The web application is usually built by a third party that specializes in e-Commerce, logistics or industrial applications in a wide range of industries. More often than not, these applications use CloudFlare for DNS, caching, and security.

In this post, we will take a look at such a case with the application running in containers on Azure Kubernetes Service (AKS). I have substituted the application with one of my own, the go-realtime app.

There’s also a video:

The big picture

Sketch of the “architecture”

The application runs in containers on an AKS cluster. Although we could expose the application using an Azure load balancer, a layer 7 load balancer such as Azure Application Gateway, referred to as AG below, is more appropriate here because it allows routing based on URLs and paths and much more.

Because Kubernetes is a dynamic environment, a component is required that configures AG automatically. Application Gateway Ingress Controller (AGIC) plays that part. AGIC configures the AG based on the ingresses we create in the cluster. In essence, that will result in a listener on the public IP that is associated with AG.

In Cloudflare, we will need to configure DNS records that use proxying. The records will point to the IP address of the AG. Below is an example of a DNS record with proxying turned on (orange cloud):

A record at Cloudflare with proxying; blurred out address of AG

Let’s look at these components in a bit more detail.

Application Gateway

Microsoft has a lot of documentation on AG, including the AGIC component. There are many options and approaches when it comes to using AG together with AKS. Some are listed below:

  • Install AKS, AG and AGIC in one step: see the docs for more information; in general, I would not follow this approach and use the next option
  • Install AKS and AG separately: you can find an example here; this allows you to deploy AKS and AG (plus its public IP) using your automation tools of choice such as ARM, Terraform or Pulumi

In most cases, we deploy AKS with Azure CNI networking. This requires a virtual network (VNet) with a subnet specifically for your AKS cluster. Only one cluster should be in the subnet.

AG also requires a subnet. You can create that subnet in the same VNet and size it according to the documentation. In virtually all cases, you should go for AG v2.

In the video above, I install AG with Azure CLI. Once AKS and AG are deployed, you will need to deploy the AGIC component.

Application Gateway Ingress Controller

You basically have two options to install AGIC:

  • Install via an AKS addon: discussed further
  • Install with a Helm chart: see Helm greenfield and Helm brownfield deployment for more information

Although the installation via an AKS addon is preferred, at the time of writing (October 2020), this method is in preview. After configuring your subscription to enable this feature and after installing the aks-preview addon for Azure CLI, you can use the following command to install AGIC:

appgwId=$(az network application-gateway show -n AGname -g AGresourcegroup -o tsv --query "id")
az aks enable-addons -n AKSclustername -g AKSResourcegroup -a ingress-appgw --appgw-id $appgwId

Indeed, you first need to find the id of the AG you deployed. This id can be found in the portal or with the first command above, which saves the result in a variable (Linux shell). The az aks enable-addons command is the command to install any addon in AKS, including the AGIC addon. The AGIC addon is called ingress-appgw.

Installation via the addon is preferred because that makes the AGIC installation part of AKS and part of the managed service for maintenance and upgrades. If you install AGIC via Helm, you are responsible for maintaining and upgrading it. In addition, the Helm deployment requires AAD pod identity, which complicates matters further. From the moment the addon is GA (generally available), I would recommend to use it exclusively, as long as your scenario supports it.

That last sentence is important because there are quite some differences between AGIC installed with Helm and AGIC installed with the addon. Those differences should disappear over time though.

Required access rights for AGIC

AGIC configures AG via ARM (Azure Resource Manager). As such, AGIC requires read and write access to AG. To check if AGIC has the correct access, use the AGIC pod logs to do so.

Indeed, the AGIC installation results in a pod in the kube-system namespace. On my system, it looks like this (from kubectl get pods -n kube-system)

ingress-appgw-deployment-7dd969fddb-jfps5 1/1 Running 0 6h50m

When you check the logs of that pod, you should see output like below:

AGIC logs displayed via the wonderful K9S tool πŸ‘

The logs show that AGIC can connect to AG properly. If however, you get 403 errors, AGIC does not have the correct access rights. That can easily be fixed by granting the Contributor role on your AG to the user managed identity used by AGIC (if the AKS addon was used). In my case, that is the following account:

User Assigned Managed Identity ingressapplicationgateway-clustername

Configuring AG via Ingresses

Now that AG and AGIC are installed and AGIC has read and write access to AG, we can created Kubernetes Ingress objects like we usually do. Below is an example:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: realtimeapp-ingress
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: "origin"
    

spec:
  rules:
  - host: rt.baeke.info
    http:
      paths:
      - path: /
        backend:
          serviceName: realtimeapp
          servicePort: 80

This is a regular Ingress definition. The ingress.class annotation is super important because it tells AGIC to do its job. The second annotation is part of our use case because we want AG to create an HTTPS listener and we want to use a certificate that is already installed on AG. That certificate contains a Cloudflare origin certificate valid for *.baeke.info and expiring somewhere in 2035. I must make sure I update that certificate at that time! πŸ˜‰

Note that this is just one way of configuring the certificate. You can also save the certificate as a Kubernetes secret and refer to it in your Ingress definition. AGIC will then push that certificate to AG. AGIC also supports Let’s Encrypt, with some help from certmgr. I will let you have some fun with that though! Tell me how it went!

From the moment we create the Ingress, AGIC will pick it up and configure AG. Here’s the listener for instance:

AG listener as created by AGIC

By the way, to create the certificate in AG, use the command below with a cert.pfx file containing the certificate and private key in the same folder:

az network application-gateway ssl-cert create -g resourcegroupname --gateway-name AGname -n origin --cert-file cert.pfx --cert-password SomePassword123

Of course, you can choose any name you like for the -n parameter.

CloudFlare Configuration

As mentioned before, you need to create proxied A or CNAME records. The user connection will go to Cloudflare, Cloudflare will do its thing and then connect to the public IP of AG, returning the results to the user.

To enforce end-to-end encryption, set the mode to Full (strict):

Cloudflare Full (strict) SSL/TLS encryption

As your Edge certificate (used at the Cloudflare edge locations), you have several options. One of those options is to use a CloudFlare Universal SSL Certificate which is free. Another option is to use the Advanced Certificate Manager which comes at an extra cost. On higher plans, you can upload your own certificates. In my case, I have Universal SSL applied but we mostly use the other two options in production scenarios:

Cloudflare Universal SSL

Via the edge certificate, users can connect securely to a Cloudflare edge location. Cloudflare itself, needs to connect securely to AG. We now need to generate an origin certificate we can install on AG:

Creating the origin certificate

The questions that follow are straightforward and not discussed here. Here they are:

Generating the origin cert

After clicking next, you will get your certificate and private key in PEM format (default), which you can use to create the .pfx file. You can use the openssl tool as discussed here. Just copy and paste the certificate and private key to separate text files, for example cert.pem and cert.key, and use them as input to the openssl command. Once you have the .pfx file, use the command shown earlier to upload it to AG.

In the Edge Certificates section, it is recommended to also enable Always use HTTPS which redirects HTTP to HTTPS traffic.

Redirect HTTP to HTTPS

Restricting AG to Cloudflare traffic

Application Gateway v2 is automatically deployed with a public IP. You can restrict access to that IP address with an NSG.

It is important to understand how the NSG works before you start creating it. The documentation provides all the information you need, but be aware the steps are different for AG v2 compared to AG v1.

Here is a screenshot of my NSG inbound rules, outbound rules were left at the default:

NSG on the AG subnet

Note that the second rule only allows access on port 443 from Cloudflare addresses as found here.

Let’s check if only Cloudflare has access with curl. First I run the following command without the NSG applied:

curl --header "Host: rt.baeke.info" https://52.224.72.167 --insecure

The above command responds with:

Response from curl command

When I apply the NSG and give it some time, the curl command times out.

Conclusion

In this post, we looked at using Application Gateway Ingress Controller, which configures Application Gateway based on Kubernetes Ingress definitions. We have also looked at combining Application Gateway with Cloudflare, by using Cloudflare proxying in combination with an Azure Network Security Group that only allows access to Application Gateway from well-known IP addresses. I hoped you liked this and if you have any remarks or spotted errors, let me know!

HashiCorp Waypoint Image Tagging

Recently (October, 2020) I posted an introduction to HashiCorp Waypoint on my YouTube channel. It shows how to build, push, deploy and release applications to Kubernetes with a single waypoint up command. If you want to check out that video first, see below ⬇⬇⬇

After watching that video, it should be clear that you drive the process from a file called waypoint.hcl. The waypoint.hcl to deploy the Azure Function app in the video, is shown below:

project = "wptest-hello"

app "wptest-hello" {
  labels = {
    "service" = "wptest-hello",
    "env" = "dev"
  }

  build {
    use "docker" {}
    registry {
        use "docker" {
          image = "gbaeke/wptest-hello"
          tag = "latest"
          local = false
        }
    }
  }

  deploy {
    use "kubernetes" {
        service_port = 80
        probe_path = "/"
    }
  }

  release {
    use "kubernetes" {
       load_balancer =  true
    }
  }
}

In the build stanza, use “docker” tells Waypoint to build the container image from a local Dockerfile. With registry, we push that image to, in this case, Docker Hub. Instead of Docker Hub, other registries can be used as well. Before the image is pushed to the registry, it is first tagged with the tag you specify. Here, that is the latest tag. Although that is easy, you should not use that tag in your workflow because you will not get different images per application version. And you certainly want that when you do multiple deploys based on different code.

To make the tag unique, you can replace “latest” with the gitrefpretty() function, as shown below:

build {
    use "docker" {}
    registry {
        use "docker" {
          image = "gbaeke/wptest-hello"
          tag = gitrefpretty()
          local = false
        }
    }
  }

Assuming you work with git and commit your code changes πŸ˜‰, gitrefpretty() will return the git commit sha at the time of build.

You can check the commit sha of each commit with git log:

git log showing each commit with its sha-1 checksum

When you use gitrefpretty() and you issue the waypoint build command, the images will be tagged with the sha-1 checksum. In Docker Hub, that is clearly shown:

Image with commit sha tag pushed to Docker Hub

That’s it for this quick post. If you have further questions, just hit me up on Twitter or leave a comment!

Azure Private Link and DNS

When you are just starting out with Azure Private Link, it can be hard figuring out how name resolution works and how DNS has to be configured. In this post, we will take a look at some of the internals and try to clear up some of the confusion. If you end up even more confused then I’m sorry in advance. Drop me your questions in the comments if that happens. πŸ˜‰ I will illustrate the inner workings with a Cosmos DB account. It is similar for other services.

Wait! What is Private Link?

Azure Private Link provides private IP addresses for services such as Cosmos DB, Azure SQL Database and many more. You choose where the private IP address comes from by specifying a VNET and subnet. Without private link, these services are normally accessed via a public IP address or via Network Service Endpoints (also the public IP but over the Azure network and restricted to selected subnets). There are several issues or shortcomings with those options:

  • for most customers, accessing databases and other services over the public Internet is just not acceptable
  • although network service endpoints provide a solution, this only works for systems that run inside an Azure Virtual Network (VNET)

When you want to access a service like Cosmos DB from on-premises networks and keep the traffic limited to your on-premises networks and Azure virtual networks, Azure Private Link is the way to go. In addition, you can filter the traffic with Azure Firewall or a virtual appliance, typically installed in a hub site. Now let’s take a look at how this works with Cosmos DB.

Azure Private Link for Cosmos DB

I deployed a Cosmos DB account in East US and called it geba-cosmos. To access this account and work with collections, I can use the following name: https://geba-cosmos.documents.azure.com:443/. As explained before, geba-cosmos.document.azure.com resolves to a public IP address. Note that you can still control who can connect to this public IP address. Below, only my home IP address is allowed to connect:

Cosmos DB configured to allow access from selected networks

In order to connect to Cosmos DB using a private IP address in your Azure Virtual Network, just click Private Endpoint Connections below Firewall and virtual networks:

Private Endpoint Connections for a Cosmos DB account with one private endpoint configured

To create a new private endpoint, click + Private Endpoint and follow the steps. The private endpoint is a resource on its own which needs a name and region. It should be in the same region as the virtual network you want to grab an IP address from. In the second screen, you can select the resource you want the private IP to point to (can be in a different region):

Private endpoint that will connect to a Cosmos DB account in my directory (target sub-resource indicates the Cosmos DB API, here the Core SQL API is used)

In the next step, you select the virtual network and subnet you want to grab an IP address from:

VNET and subnet to grab the IP address for the private endpoint

In this third step (Configuration), you will be asked if you want Private DNS integration. The default is Yes but I will select No for now.

Note: it is not required to use a Private DNS zone with Private Link

When you finish the wizard and look at the created private endpoint, it will look similar to the screenshot below:

Private endpoint configured

In the background, a network interface was created and attached to the selected virtual network. Above, the network interface is pe-geba-cosmos.nic.a755f7ad-9d54-4074-996c-8a14e9434898. The network interface screen will look like the screenshot below:

Network interface attached to subnet servers in VNET vnet-us1; it grabbed the next available IP of 10.1.0.5 as primary (but also 10.1.0.6 as secondary; click IP configurations to see that)

The interesting part is the Custom DNS Settings. How can you resolve the name geba-cosmos.documents.azure.com to 10.1.0.5 when a client (either in Azure or on-premises) requests it? Let’s look at DNS resolution next…

DNS Resolution

Let’s use dig to check what a request for a Cosmos DB account return without private link. I have another account, geba-test, that I can use for that:

dig with a Cosmos DB account without private link

The above DNS request was made on my local machine, using public DNS servers. The response from Microsoft DNS servers for geba-test.documents.azure.com is a CNAME to a cloudapp.net name which results in IP address 40.78.226.8.

The response from the DNS server will be different when private link is configured. When I resolve geba-cosmos.documents.azure.com, I get the following:

Resolving the Cosmos DB hostname with private link configured

As you can see, the Microsoft DNS servers respond with a CNAME of accountname.privatelink.documents.azure.com. but by default that CNAME goes to a cloudapp.net name that resolves to a public IP.

This means that, if you don’t take specific action to resolve accountname.privatelink.documents.azure.com to the private IP, you will just end up with the public IP address. In most cases, you will not be able to connect because you will restrict public access to Cosmos DB. It’s important to note that you do not have to restrict public access and that you can enable both private and public access. Most customers I work with though, restrict public access.

Resolving to the private IP address

Before continuing, it’s important to state that developers should connect to https://accountname.documents.azure.com (if they use the gateway mode). In fact, Cosmos DB expects you to use that name. Don’t try to connect with the IP address or some other name because it will not work. This is similar for services other than Cosmos DB. In the background though, we will make sure that accountname.documents.azure.com goes to the internal IP. So how do we make that happen? In what follows, I will list a couple of solutions. I will not discuss using a hosts file on your local pc, although it is possible to make that work.

Create privatelink DNS zones on your DNS servers
Note: this approach is not recommended; it can present problems later when you need to connect to private link enabled services that are not under your control

This means that in this case, we create a zone for privatelink.documents.azure.com on our own DNS servers and add the following records:

  • geba-cosmos.privatelink.documents.azure.com. IN A 10.1.0.5
  • geba-cosmos-eastus.privatelink.documents.azure.com. IN A 10.1.0.6

Note: use a low TTL like 10s (similar to Azure Private DNS; see below)

When the DNS server has to resolve geba-cosmos.documents.azure.com, it will get the CNAME response of geba-cosmos.privatelink.documents.azure.com and will be able to answer authoritatively that that is 10.1.0.5.

If you use this solution, you need to make sure that you register the custom DNS settings listed by the private endpoint resource manually. If you want to try this yourself, you can easily do this with a Windows virtual machine with the DNS role or a Linux VM with bind.

Use Azure Private DNS zones
If you do not want to register the custom DNS settings of the private endpoint manually in your own DNS servers, you can use Azure Private DNS. You can create the private DNS zone during the creation of the private endpoint. An internal zone for privatelink.documents.azure.com will be created and Azure will automatically add the required DNS configuration the private endpoint requires:

Azure Private DNS with automatic registration of the required Cosmos DB A records

This is great for systems running in Azure virtual networks that are associated with the private DNS zone and that use the DNS servers provided by Azure but you still need to integrate your on-premises DNS servers with these private DNS zones. The way to do that is explained in the documentation. In particular, the below diagram is important:

On-premises forwarding to Azure DNS
Source: Microsoft docs

The example above is for Azure SQL Database but it is similar to our Cosmos DB example. In essence, you need the following:

  • DNS forwarder in the VNET (above, that is 10.5.0.254): this is an extra (!!!) Windows or Linux VM configured as a DNS forwarder; it should forward to 168.63.129.16 which points to the Azure-provided DNS servers; if the virtual network of the VM is integrated with the private DNS zone that hosts privatelink.documents.azure.com, the A records in that zone can be resolved properly
  • To allow the on-premises server to return the privatelink A records, setup conditional forwarding for documents.azure.com to the DNS forwarder in the virtual network

What should you do?

That’s always difficult to answer but most customers I work with initially tend to go for option 1. They want to create a zone for privatelink.x.y.z and register the records manually. Although that could be automated, it’s often a manual step. In general, I do not recommend using it.

I prefer the private DNS method because of the automatic registration of the records and the use of conditional forwarding. Although I don’t like the extra DNS servers, they will not be needed most of the time because customers tend to work with the hub/spoke model and the hub already contains DNS servers. Those DNS servers can then be configured to enable the resolution of the privatelink zones via forwarding.

Azure Security Center and Azure Kubernetes Service

Quick post and note to self today… Azure Security Center checks many of your resources for vulnerabilities or attacks. For a while now, it also does so for Azure Kubernetes Service (AKS). In my portal, I saw the following:

Attacked resources?!? Now what?

There are many possible alerts. These are the ones I got:

Some of the alerts for AKS in Security Center

The first one, for instance, reports that a container has mounted /etc/kubernetes/azure.json on the AKS worker node where it runs. That is indeed a sensitive path because azure.json contains the credentials of the AKS security principal. In this case, it’s Azure Key Vault Controller that has been configured to use this principal to connect to Azure Key Vault.

Another useful one is the alert for new high privilege roles. In my case, these alerts are the result from installing Helm charts that include such a role. For example, the helm-operator chart includes a role which uses a ClusterRoleBinding for [{“resources”:[“*”],”apiGroups”:[“*”],”verbs”:[“*”]}]. Yep, that’s high privilege indeed.

Remember, you will need Azure Security Center Standard for these capabilities. Azure Kubernetes Services is charged per AKS core at $2/VM core/month in the preview (according to what I see in the portal).

Security Center Standard pricing in preview for AKS

Be sure to include Azure Security Center Standard when you are deploying Azure resources (not just AKS). The alerts you get are useful. In most cases, you will also learn a thing or two about the software you are deploying! πŸ˜†

Creating Kubernetes secrets from Key Vault

If you do any sort of development, you often have to deal with secrets. There are many ways to deal with secrets, one of them is retrieving the secrets from a secure system from your own code. When your application runs on Kubernetes and your code (or 3rd party code) cannot be configured to retrieve the secrets directly, you have several options. This post looks at one such solution: Azure Key Vault to Kubernetes from Sparebanken Vest, Norway.

In short, the solution connects to Azure Key Vault and does one of two things:

In my scenario, I just wanted regular secrets to use in a KEDA project that processes IoT Hub messages. The following secrets were required:

  • Connection string to a storage account: AzureWebJobsStorage
  • Connection string to IoT Hub’s event hub: EventEndpoint

In the YAML that deploys the pods that are scaled by KEDA, the secrets are referenced as follows:

env:
 - name: AzureFunctionsJobHost__functions__0
   value: ProcessEvents
 - name: FUNCTIONS_WORKER_RUNTIME
   value: node
 - name: EventEndpoint
   valueFrom:
     secretKeyRef:
       name: kedasample-event
       key: EventEndpoint
 - name: AzureWebJobsStorage
   valueFrom:
     secretKeyRef:
       name: kedasample-storage
       key: AzureWebJobsStorage

Because the YAML above is deployed with Flux from a git repo, we need to get the secrets from an external system. That external system in this case, is Azure Key Vault.

To make this work, we first need to install the controller that makes this happen. This is very easy to do with the Helm chart. By default, this Helm chart will work well on Azure Kubernetes Service as long as you give the AKS security principal read access to Key Vault:

Access policies in Key Vault (azure-cli-2019-… is the AKS service principal here)

Next, define the secrets in Key Vault:

Secrets in Key Vault

With the access policies in place and the secrets defined in Key Vault, the controller installed by the Helm chart can do its work with the following YAML:

apiVersion: spv.no/v1alpha1
kind: AzureKeyVaultSecret
metadata:
  name: eventendpoint
  namespace: default
spec:
  vault:
    name: gebakv
    object:
      name: EventEndpoint
      type: secret
  output:
    secret: 
      name: kedasample-event
      dataKey: EventEndpoint
      type: opaque
---
apiVersion: spv.no/v1alpha1
kind: AzureKeyVaultSecret
metadata:
  name: azurewebjobsstorage
  namespace: default
spec:
  vault:
    name: gebakv
    object:
      name: AzureWebJobsStorage
      type: secret
  output:
    secret: 
      name: kedasample-storage
      dataKey: AzureWebJobsStorage
      type: opaque     

The above YAML defines two objects of kind AzureKeyVaultSecret. In each object we specify the Key Vault secret to read (vault) and the Kubernetes secret to create (output). The above YAML results in two Kubernetes secrets:

Two regular secrets

When you look inside such a secret, you will see:

Inside the secret

To double check the secret, just do echo RW5K… | base64 -d to see the decoded secret and that it matches the secret stored in Key Vault. You can now reference the secret with ValueFrom as shown earlier in this post.

Conclusion

If you want to turn Azure Key Vault secrets into regular Kubernetes secrets for use in your manifests, give the solution from Sparebanken Vest a go. It is very easy to use. If you do not want regular Kubernetes secrets, opt for the Env Injector instead, which injects the environment variables directly in your pod.

Creating a Kubernetes operator on Windows and WSL

I have always wanted to create a Kubernetes operator with the operator framework and tried to give that a go on my Windows 10 system. Note that the emphasis is on creating an operator, not necessarily writing a useful one 😁. All I am doing is using the boilerplate that is generated by the framework. If you have never even seen how this is done, then this post if for you. πŸ‘

An operator is an application-specific controller. A controller is a piece of software that implements a control loop, watching the state of the Kubernetes cluster via the API. It makes changes to the state to drive it towards the desired state.

An operator uses Kubernetes to create and manage complex applications. Many operators can be found here: https://operatorhub.io/. The Cassandra operator for instance, has domain-specific knowledge embedded in it, that knows how to deploy and configure this database. That’s great because that means some of the burden is shifted from you to the operator.

Installation

I installed the Operator SDK CLI from the GitHub releases in WSL, Windows Subsystem for Linux. I am using WSL 1, not WSL 2 as I am not running a Windows Insiders release. The commands to run:

RELEASE_VERSION=v0.13.0 

curl -LO https://github.com/operator-framework/operator-sdk/releases/download/${RELEASE_VERSION}/operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu 

chmod +x operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu && sudo mkdir -p /usr/local/bin/ && sudo cp operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu /usr/local/bin/operator-sdk && rm operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu 

You should now be able to run operator-sdk in WSL 1.

Creating an operator

In WSL, you should have installed Go. I am using version 1.13.5. Although not required, I used my Go path on Windows to generate the operator and not the %GOPATH set in WSL. My working directory was:

/mnt/c/Users/geert/go/src/github.com/baeke.info

To create the operator, I ran the following commands (one line):

export GO111MODULE=on

operator-sdk new fun-operator --repo github.com/baeke.info/fun-operator

This creates a folder, fun-operator, under baeke.info and sets up the project:

Project structure in VS Code

Before continuing, cd into fun-operator and run go mod tidy. Now we can run the following command:

operator-sdk add api --api-version=fun.baeke.info/v1alpha1 --kind FunOp

This creates a new CRD (Custom Resource Definition) API called FunOp. The API version is fun.baeke.info/v1alpha1 which you choose yourself. With the above you can create CRDs like below that the operator acts upon:

apiVersion: fun.baeke.info/v1alpha1
kind: FunOp
metadata:
  name: example-funop 

Now we can add a controller that watches for the above CRD resource:

operator-sdk add controller --api-version=fun.baeke.info/v1alpha1 --kind=FunOp

The above will generate a file, funop_controller.go, that contains some boilerplate code that creates a busybox pod. The Reconcile function is responsible for doing this work:

Reconcile function in the controller (incomplete)

As stated above, I will just use the boilerplate code and build the project:

operator-sdk build gbaeke/fun-operator

In WSL 1, you cannot run Docker so the above command will build the operator from the Go code but fail while building the container image. Can’t wait for WSL 2! The build creates the following artifact:

fun-operator in _output/bin

The supplied Dockerfile can be used to build the container images in Windows. In Windows, copy the Dockerfile from the build folder to the root of the operator project (in my case C:\Users\geert\go\src\github.com\baeke.info\fun-operator) and run docker build and push:

docker build -t gbaeke/fun-operator .

docker push gbaeke/fun-operator

Deploying the operator

The project folder structure contains a bunch of yaml in the deploy folder:

Great! Some YAML to deploy

The service account, role and role binding make sure your code can create (or delete/update) resources in the cluster. The operator.yaml actually deploys the operator on your cluster. You just need to update the container spec with the name of your image (here gbaeke/fun-operator).

Before you deploy the operator, make sure you deploy the CRD manifest (here fun.baeke.info_funops_crd.yaml).

As always, just use kubectl apply -f with the above YAML files.

Testing the operator

With the operator deployed, create a resource based on the CRD. For instance:

apiVersion: fun.baeke.info/v1alpha1
kind: FunOp
metadata:
  name: example-funop  

From the moment you create this resource with kubectl apply, a pod will be created by the operator.

pod created upon submitting the custom resource

When you delete example-funop, the pod will be removed by the operator.

That’s it! We created a Kubernetes operator with the boilerplate code supplied by the operator-sdk cli. Another time, maybe we’ll create an operator that actually does something useful! πŸ˜‰

Use a Power Automate Button to start an Azure DevOps build on the go

In a previous post, we built a pipeline to deploy AKS using Azure DevOps. Because it can take while to deploy, it can be handy to start the deployment at any time without having to logon to Azure DevOps. There are many ways to achieve this, but one of the easiest ways is Power Automate.

Microsoft have made it easy to create such a flow because they support Azure DevOps out of the box. The flow looks like this:

Flow to trigger an Azure DevOps build

The flow uses a manual trigger which allows you to start the flow from the iOS app using a button:

Button in the iOS app that triggers the flow

As simple as that… πŸ‘

Trying Civo’s Kubernetes Service

In a previous post I talked about k3sup, a tool to easily install k3s on any system available over SSH. If you don’t know what k3s is, it’s a lightweight version of Kubernetes. It also runs on ARMv7 and ARM64 processors. That means it’s also compatible with a Raspberry Pi.

If I am not mistaken, Civo is the first cloud provider that offers a managed k3s service. Just like the other Civo services it is very easy to use. At this point in time, the service is in beta and you need to be accepted to participate.

Deploying the cluster

The cluster can be deployed via the portal, CLI or the REST API. Portal deployment is very simple:

  • set a name
  • set the size of the nodes
  • set the number of nodes
Creating a new cluster

After deployment, you will see the cluster as follows:

Yes, a deployed cluster

Marketplace

Kubernetes on Civo comes with a marketplace of Kubernetes apps to install during or after cluster deployment. By default, Traefik is selected but you can add other apps. I added Helm for instance:

My installed apps plus a view on the marketplace

Getting your Kubeconfig

You can use the portal to grab the Kubeconfig file:

Downloading Kubeconfig

Then, in your shell, set the KUBECONFIG environment variable to the path where you downloaded the file. Alternatively, you can use the Civo CLI to obtain the Kubeconfig file.

Deploying an application

Let’s install my image classifier app to the cluster and expose it via Traefik. Let’s look at the Traefik service in the cluster:

Traefik service in the cluster

If you look closely, you will see that the Traefik service is exposed on each node. Currently, there is no integration with Civo’s load balancers. You do get a DNS name that uses round robin over the IP addresses of the nodes. The DNS name is something like 232b548e-897f-41d3-86f6-1a2a38516a58.k8s.civo.com.

Let’s install and expose my image classifier with the following basic YAML:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nasnet-ingress
  namespace: default
  annotations:
    kubernetes.io/ingress.class: traefik
spec:
  rules:
  - host: IPADDRESS.nip.io
    http:
      paths:
      - path: /
        backend:
          serviceName: nasnet-svc
          servicePort: 80
---
kind: Service
apiVersion: v1
metadata:
  name: nasnet-svc
spec:
  selector:
    app: nasnet
  ports:
  - protocol: TCP
    port: 80
    targetPort: 9090
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nasnet-app
  labels:
    app: nasnet
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nasnet
  template:
    metadata:
      labels:
        app: nasnet
    spec:
      containers:
      - name: nasnet
        image: gbaeke/nasnet
        resources:
          limits:
            cpu: "0.5"
          requests:
            cpu: "0.2"
        ports:
        - containerPort: 9090

In the above YAML, replace IPADDRESS with one of the IP addresses of your nodes. With a little help of nip.io, that name will resolve to the IP address that you specify.

The result:

Creepy but it works!

Conclusion

This was just a quick look at Civo’s Kubernetes service. It is easy to install and comes with an easy to use marketplace to quickly get started. In a relatively short time, they were able to get this up and running quickly. I am sure it will rapidly evolve into a great contender to the other managed Kubernetes services out there.