Azure SQL Database High Availability

Creating a SQL Database in Azure is a simple matter: create the database and server, get your connection string and off you go! Before starting though, spend some time thinking about the level of high availability (HA) that you want:

  • What is the required level of HA within the deployment region (e.g. West Europe)?
  • Do you require failover to another region (e.g. from West Europe to North Europe)?

HA in a single region

To achieve the highest level of availability in a region, do the following:

  • Use the Premium (DTU) or Business Critical tier (vCore): Azure will use the premium availability model for your database
  • Enable Availability Zone support if the region supports it: copies of your database will be spread over the zones

The diagram below illustrates the premium availability model (from the Microsoft docs):

Premium Availability Model

The region will contain one primary read/write replica and several secondary replicas. Each replica uses local SSD storage. The database is replicated synchronously and failover is swift and without data loss. If required, you can enable a read replica and specify you want to connect to the read replica by adding ApplicationIntent=ReadOnly to the connection string. Great for reporting and BI!

Spreading the databases over multiple zones is as simple as checking a box. Availability zone support comes at no extra cost but can increase the write commit latency because the nodes are a few kilometers apart. The option to enable zone support is in the Configure section as shown below:

Enabling zone redundancy

To read more about high availability, including the standard availability model for other tiers, check out the docs.

For critical applications, we typically select the Premium/Business Critical model as it provides good performance coupled to the highest possible availability in a region.

Geo-replication

The geo-replication feature replicates a database asynchronously to another region. Suppose you have a database and server in West Europe that you want to replicate to France Central. In the portal, navigate to the database (not the server) and select Geo-Replication. Then check the region, in this case France Central. The following questions show up:

Geo-Replication

A database needs a logical server object that contains the database. To replicate the database to France Central, you need such a server object in that region. The UI above allows you to create that server.

Note that the databases need to use the same tier although the secondary can be configured with less DTUs or vCores. Doing so is generally not recommended.

After configuration, the UI will show the active replication. In this case, I am showing replication from North Europe to West Europe (and not France Central):

Geo-replication is easy to configure but in practice, we recommend to use Failover Groups. A Failover Group uses geo-replication under the hood but gives you some additional features such as:

  • Automated failover (vs manual with just geo-replication)
  • Two servers to use in your connection string that use CNAMEs that are updated in case of failover; one CNAME always points to the read/write replica, the other to the read replica

Failover groups are created at the server level instead of the database level:

Failover group

Below, there is a failover group aks-fo with the primary server in North Europe and the secondary in West Europe:

Failover group details

You can manually fail over the database if needed:

Failover and forced failover

Failover allows you to failover the database without data loss if both databases are still active. Forced failover performs a failover even if the primary is down, which might lead to data loss.

Note: when you configure a Failover Group, geo-replication will automatically be configured for you.

Connecting from an application

To connect to a database configured in a failover group, first get the failover group server names:

Read/write and read-only listener endpoints

Next, in your application, use the appropriate connection string. For instance, in Go:

var sqldb *sql.DB
var server = "aks-fo.database.windows.net"
var port = 1433
var user = "USERNAME"
var password = "PASSWORD"
var database = "DBNAME"

func init() {
    // Build connection string
    connString := fmt.Sprintf("server=%s;user id=%s;password=%s;port=%d;database=%s;",
        server, user, password, port, database)

    var err error

    // Create connection pool
    sqldb, err = sql.Open("sqlserver", connString)
    if err != nil {
        log.Fatal("Error creating connection pool: ", err.Error())
    }
    ctx := context.Background()

    //above commands actually do not connect to SQL but the ping below does
    err = sqldb.PingContext(ctx)
    if err != nil {
        log.Fatal(err.Error())
    }
    log.Printf("Connected!\n")
}

During a failover, there will be an amount of time that the database is not available. When that happens, the connection will fail. The error is shown below:

[db customers]: invalid response code 500, body: {"name":"fault","id":"7JPhcikZ","message":"Login error: mssql: Database 'aksdb' on server 'akssrv-ne' is not currently available.  Please retry the connection later.  If the problem persists, contact customer support, and provide them the session tracing ID of '6D7D70C3-D550-4A74-A69C-D689E6F5CDA6'.","temporary":false,"timeout":false,"fault":true}

Note: the Failover Group uses a CNAME record aks-fo.database.windows.net which resolves to the backend servers in either West or North Europe. Make sure you allow connections to these servers in the firewall or you will get the following error:

db customers]: invalid response code 500, body: {"name":"fault","id":"-p9TwZkm","message":"Login error: mssql: Cannot open server 'akssrv-ne' requested by the login. Client with IP address 'IP ADDRESS' is not allowed to access the server.  To enable access, use the Windows Azure Management Portal or run sp_set_firewall_rule on the master database to create a firewall rule for this IP address or address range.  It may take up to five minutes for this change to take effect.","temporary":false,"timeout":false,"fault":true} 

Conclusion

For the highest level of availability, use the regional premium availability model with Availability Zone support. In addition, use a Failover Group to enable replication of the database to another region. A Failover Group automatically connects your application to the primary (read/write) or secondary replica (read) via a CNAME and can failover automatically after some grace period.

Azure Front Door and multi-region deployments

In the previous post, we looked at publishing and securing an API with Azure Front Door and Azure Web Application Firewall. The API ran on Kubernetes, exposed by Kong and Kong Ingress Controller. Kong was configured to require an API key to call the /users API, allowing us to identify the consumer of the API. The traffic flow was as follows:

Consumer -- HTTPS --> Azure Front Door with WAF policy -- HTTPS --> Kong (exposed with Azure Load Balancer) -- HTTP --> API Kubernetes service --> API pods 

Although Kubernetes makes the API(s) highly available, you might want to take extra precautions such as deploying the API in multiple regions. In this post, we will take a look at doing so. That means we will deploy the API in both West and North Europe, in two distinct Kubernetes clusters:

  • we-clu: Kubernetes cluster in West Europe
  • ne-clu: Kubernetes cluster in North Europe

The flow is very similar of course:

Consumer -- HTTPS --> Azure Front Door with WAF policy -- HTTPS --> Kong (exposed with Azure Load Balancer; region to connect to depends on Front Door configuration and health probes) -- HTTP --> API Kubernetes service --> API pods  

Let’s take a look at the configuration! By the way, the supporting files to deploy the Kubernetes objects are here: https://github.com/gbaeke/api-kong/tree/master. To deploy Kong, check out this post.

Kubernetes

We deploy a Kubernetes cluster in each region, install Helm, deploy Kong, deploy our API and configure ingresses and related Kong custom resource definitions (CRDs). The result is an external IP address in each region that leads to the Kong proxy. Search for “kong” on this blog to find posts with more details about this deployment.

Note that the API deployment specifies an environment variable that will contain the string WE or NE. This environment variable will be displayed in the output when we call the API. Here is the API deployment for West Europe:

apiVersion: v1
kind: Service
metadata:
  name: func
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: func
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: func
spec:
  replicas: 2
  selector:
    matchLabels:
      app: func
  template:
    metadata:
      labels:
        app: func
    spec:
      containers:
      - name: func
        image: gbaeke/ingfunc
        env:
        - name: REGION
          value: "WE"
        ports:
        - containerPort: 80 

When we call the API and Azure Front Door uses the backend in West Europe, the result will be:

WE included in the response from the West Europe cluster

Origin APIs

The origin APIs need to be exposed on the public Internet using a DNS name. Azure Front Door requires a public backend to connect to. Naturally, the backend can be configured to only accept incoming requests from Front Door. In our case, the APIs are available on the public IP of the Kong proxy. The following names were used:

  • api-o-we.baeke.info: Kong proxy in West Europe
  • api-o-ne.baeke.info; Kong proxy in North Europe

Both endpoints are configured to accept TLS connections only, and use a Let’s Encrypt wildcard certificate for *.baeke.info.

Front Door Configuration

The Front Door designer looks the same as in the previous post:

Front Door Designer

However, the backend pool api-o now has two backends:

Two backend hosts, both enabled with same priority and weight

To determine the health of the backend, Front Door needs to be configured with a health probe that returns status code 200. If we were to specify the probe below, the health probe would fail:

Errrrrrr, this won’t work

The health probe would hit Kong’s proxy and return a 404 (Not Found). We did not create a route for /, only for /users. With Azure Front Door, when all health probes fail, all backends are considered healthy. Yes, you read that right.

Although we could create a route called /health that returns a 200, we will use the following probe just to make it work:

Fixing the health probe (quick and dirty fix); just can the /users API

If you are exposing multiple APIs on each cluster, the health probe above would not make sense. Also note that the purpose of the health probe is to determine if the cluster is up or not. It will not fix one API behaving badly or being removed accidentally!

You can check the health probes from the portal:

Yep! Health probes in West and North Europe are at 100%

Connection test

When I connect from my home laptop in Belgium, I get the following response:

Connection to West Europe cluster

When I connect from my second home in Dublin 🤷‍♀️ I get:

Connection from a VM in North Europe (I was kidding about the second home)

If you enable logging to Log Analytics, you can check this in the FrontDoorAccessLog:

Connection from home via Brussels (West Europe would show DB in Tenant_x)

When I remove the /users API in West Europe (kubectl delete deploy func), my home laptop will connect to North Europe as expected:

I didn’t fake this! It’s 100% real, my laptop now connects to the North Europe cluster via Front Door as expected

Note that the calls will not fail from the moment you delete the /users API (the health probe here). That depends on the following setting (backends in Front Door designer):

When should the backend be determined healthy or unhealthy; decrease sample size and or samples to make it go faster

The backend health percentage graph indicates the probe failure as well:

We’ve lost West Europe folks!

Conclusion

When you are going for a multi-region deployment of services, Azure Front Door is one of the options. Of course, there is much more to a multi-region deployment than the “front-end stuff” described in this post. What do you do with databases for instance? Can you use active-active write regions (e.g. Cosmos DB) or does your database only support active/passive with read replicas?

As in other load balancing and fail over solutions, proper health probes are crucial in the design. Think about what a good health probe can be and what it means when it is not available. One option is to just write a health probe exposed via an endpoint such as /health that merely returns a 200 status code. But your health probe could also be designed to connect to backend systems such as databases or queues to determine the health of the system.

Hopefully, this post gives you some ideas to start! Follow me on Twitter for updates.