Improving an Azure Function that writes IoT Hub data to TimescaleDB

In an earlier post, I used an Azure Function to write data from IoT Hub to a TimescaleDB hypertable on PostgreSQL. Although that function works for demo purposes, there are several issues. Two of those issues will be addressed in this post:

  1. the INSERT INTO statement used the NOW() function instead of the enqueuedTimeUtc field; that field is provided by IoT Hub and represents the time the message was enqueued
  2. the INSERT INTO query does not use upsert functionality; if for some reason you need to process the IoT Hub data again, you will end up with duplicate data; you code should be idempotent

Using enqueuedTimeUtc

Using the time the event was enqueued means we need to retrieve that field from the message that our Azure Function receives. The Azure Function receives outside information via two parameters: context and eventHubMessage. The enqueuedTimeUtc field is retrieved via the context variable: context.bindingData.enqueuedTimeUtc.

In the INSERT INTO statement, we need to use TIMESTAMP ‘UCT time’. In JavaScript, that results in the following:

'insert into conditions(time, device, temperature, humidity) values(TIMESTAMP \'' + context.bindingData.enqueuedTimeUtc + '\',\'' + eventHubMessage.device + '\' ...

Using upsert functionality

Before adding upsert functionality, add a unique constraint to the hypertable like so (via pgAdmin):

CREATE UNIQUE INDEX on conditions (time, device); 

It needs to be on time and device because the time field on its own is not guaranteed to be unique. Now modify the INSERT INTO statement like so:

'insert into conditions(time, device, temperature, humidity) values(TIMESTAMP \'' + context.bindingData.enqueuedTimeUtc + '\',\'' + eventHubMessage.device + '\',' + eventHubMessage.temperature + ',' + eventHubMessage.humidity + ') ON CONFLICT DO NOTHING'; 

Notice the ON CONFLICT clause? When any constraint is violated, we do nothing. We do not add or modify data, we leave it all as it was.

The full Azure Function code is below:

Azure Function code with IoT Hub enqueuedTimeUtc and upsert

Conclusion

The above code is a little bit better already. We are not quite there yet but the two changes make sure that the date of the event is correct and independent from when the actual processing is done. By adding the constraint and upsert functionality, we make sure we do not end up with duplicate data when we reprocess data from IoT Hub.

Azure Functions with Consumption Plan on Linux

In a previous post, I talked about saving time-series data to TimescaleDB, which is an extension on top of PostgreSQL. The post used an Azure Function with an Event Hub trigger to save the data in TimescaleDB with a regular INSERT INTO statement.

The Function App used the Windows runtime which gave me networking errors (ECONNRESET) when connecting to PostgreSQL. I often encounter those issues with the Windows runtime. In general, for Node.js, I try to stick to the Linux runtime whenever possible. In this post, we will try the same code but with a Function App that uses the Linux runtime in a Consumption Plan.

Make sure Azure CLI is installed and that you are logged in. First, create a Storage Account:

az storage account create --name gebafuncstore --location westeurope --resource-group funclinux --sku Standard_LRS

Next, create the Function App. It references the storage account you created above:

az functionapp create --resource-group funclinux --name funclinux --os-type Linux --runtime node --consumption-plan-location westeurope --storage-account gebafuncstore

You can also use a script to achieve the same results. For an example, see
https://docs.microsoft.com/en-us/azure/azure-functions/scripts/functions-cli-create-serverless.

Now, in the Function App, set the following Application Settings. These settings will be used in the code we will deploy later.

  • host: hostname of the PostgreSQL server (e.g. servername.postgres.database.azure.com)
  • user: user name (e.g. user@servername)
  • password
  • database: name of the PostgreSQL database
  • EH: connection string to the Event Hub interface of your IoT Hub; if your are unsure how to set this, see this post

You can set the above values from the Azure Portal:

Application Settings of the Function App

The function uses the first four Application Settings in the function code via process.env:

Using Application Settings in JavaScript

The application setting EH is used to reference the Event Hub in function.json:

function.json with Event Hub details such as the connection, cardinality and the consumerGroup

Now let’s get the code from my GitHub repo in the Azure Function. First install Azure Function Core Tools 2.x. Next, create a folder called funcdemo. In that folder, run the following commands:

git clone https://github.com/gbaeke/pgfunc.git
cd pgfunc
npm install
az login
az account show

The npm install command installs the pg module as defined in package.json. The last two commands log you in and show the active subscription. Make sure that subscription contains the Function App you deployed above. Now run the following command:

func init

Answer the questions: we use Node and JavaScript. You should now have a local.settings.json file that sets the FUNCTIONS_WORKER_RUNTIME to node. If you do not have that, the next command will throw an error.

Now issue the following command to package and deploy the function to the Function App we created earlier:

func azure functionapp publish funclinux

This should result in the following feedback:

Feedback from function deployment

You should now see the function in the Function App:

Deployed function

To verify that the function works as expected, I started my IoT Simulator with 100 devices that send data every 5 seconds. I also deleted all the existing data from the TimescaleDB hypertable. The Live Metrics stream shows the results. In this case, the function is running smoothly without connection reset errors. The consumption plan spun up 4 servers:

Live Metrics Stream of IoT Hub to PostgreSQL function

IoT with Azure Database for PostgreSQL and TimescaleDB

In IoT projects, the same question always comes up: “Where do we store our telemetry data?”. As usual, the answer to that question is not straightforward. We have seen all kinds of solutions in the wild:

  • save directly to a relational database (SQL Server, MySQL, …)
  • save to a data lake and/or SQL
  • save to Cosmos DB or similar (e.g. MongoDB)
  • save to Azure Table Storage or similar
  • save to Time Series Insights

Saving the data to a relational database is often tempting. It fits in existing operational practices and it is easy to extract, transform and visualize the data. In practice, I often recommend against this approach except in the simplest of use cases. The reason is clear: these databases are not optimized for fast ingestion of time-series data. Instead, you should use a time-series database which is optimized for fast ingest and efficient processing of time-series data.

TimescaleDB

TimescaleDB is a an open-source time-series databases optimized for fast ingest even when the amount of data stored becomes large. It does not stand on its own, as it runs on PostgreSQL as an extension. Note that you can store time-series in a regular table or as a TimescaleDB hypertable. The graphic below (from this post), shows the difference:

A comparison on Azure PostgreSQL with and without TimescaleDB and observed degradation in insert performance over time.
Test on general purpose compute Gen 5 with 8 vCores, 45GB RAM with Premium Storage

The difference is clear. With a regular table, the insert rate is lowered dramatically when the amount of data becomes large.

The TimescaleDB extension can easily be installed on Azure Database for PostgreSQL. Let’s see how that goes shall we?

Installing TimescaleDB

To create an Azure Database for PostgreSQL instance, I will use the Azure CLI with the db-up extension:

az postgres up -g RESOURCEGROUP -s SERVERNAME -d DBNAME -u USER -p PASSWORD 

The server name you provide should result in a unique URL for your database (e.g. servername.postgres.database.azure.com).

Tip: do not use admin as the user name ūüĎć

When the server has been provisioned, modify the server confguration for the TimescaleDB extension:

az postgres server configuration set --resource-group RESOURCEGROUP ‚Äď‚Äďserver-name SERVERNAME --name shared_preload_libraries --value timescaledb

Now you need to actually install the extension. Install pgAdmin and issue the following query:

CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;

In pgAdmin, you should see extra schemas for TimescaleDB:

Extra schemas for TimescaleDB

Creating a hypertable

A hypertable uses partitioning to optimize writing and reading time-series data. Creating such a table is straightforward. You start with a regular table:

CREATE TABLE conditions (   
time TIMESTAMPTZ NOT NULL,
location TEXT NOT NULL,
temperature DOUBLE PRECISION NULL,
humidity DOUBLE PRECISION NULL );

Next, convert to a hypertable:

SELECT create_hypertable('conditions', 'time');

The above command partitions the data by time, using the values in the time column. By default, the time interval for partitioning is set to 7 days, starting from version 0.11.0 of TimescaleDB. You can override this by setting chunk_time_interval when creating the hypertable. You should make sure that the chunk belonging to the most recent interval can fit into memory. According to best practices, such a chunk should not use more than 25% of main memory.

Now that we have the hypertable, we can write time-series data to it. One advantage of being built on top of a relational database such as PostgreSQL is that you can use standard SQL INSERT INTO statements. For example:

const query = 'insert into conditions(time, device, temperature, humidity) values(NOW(),\'' + eventHubMessage.device + '\',' +         + eventHubMessage.temperature + ',' + eventHubMessage.humidity + ');';

The example above is from an Azure Function we will look at in a moment. In extracts values from a message received via IoT Hub and inserts them into the hypertable via an INSERT INTO query.

Let’s take a look at the Azure Function next.

Azure Function: from IoT Hub to the Hypertable

The Azure Function is kept bare bones to focus on the essentials. Note that you will need to open the console and install the pg module with the following command:

npm install pg

The image below shows the Azure Function (based on this although the article does not use a hypertable and stores the telemetry as JSON).

Bare bones Azure Function to write IoT Hub data to the hypertable

Naturally, the Azure Function above requires an Azure Event Hubs trigger. In this case, event hub cardinality was set to One. More information here. Note that you should NOT use the NOW() function to set the time. It’s only used here for demo purposes. Instead, you should take the timestamp sent by the device or the time the data was queued at the Event Hub!

Naturally, you will also need an IoT Hub where you send your data. In this case, I created a standard IoT Hub and used the IoT Hub Visual Studio Code extension to generate code (1 in image below) to send sample messages. I modified the code somewhat to include the device name (2 in image below):

Visual Studio toolkit used to create the device, generate code and modify the code

Now we can run the code (saved as sender.js) with:

node sender.js

Note: do not forget to first run npm install azure-iot-device

Data is being sent:

Data sent to IoT Hub

Data processed by the Azure Function as viewed in Application Insights Live Metrics stream:

Application Insights Live Metrics stream

With only one device sending data, there isn’t that much to do! In pgAdmin, you should see connections from at least one of the Azure Function hosts that are active:

Connections to PostgreSQL

Note: I encountered some issues with ECONNRESET errors under higher load; take a look at this post which runs the same function on a Linux Consumption Plan

Querying the data

TimescaleDB, with help from PostgreSQL, has a rich query language especially when compared to some other offerings. Yes, I am looking at you Cosmos DB! ūüėČ Below are some examples (based on the documentation at https://docs.timescale.com/v1.2/using-timescaledb/reading-data:

SELECT COUNT(*) FROM conditions   WHERE time > NOW() - interval '1 minute';

The above query simply counts the messages in the last minute. Notice the flexibility in expressing the time which is what we want from time-series databases.

SELECT time_bucket('1 minute', time) AS one_min, device, COUNT(*),     MAX(temperature) AS max_temp, MAX(humidity) AS max_hum FROM conditions   WHERE time > NOW() - interval '10 minutes' GROUP BY one_min, device   ORDER BY one_min DESC, max_temp DESC;

The above query displays the following result:

Conclusion

When dealing with time-series data, it is often beneficial to use a time-series database. They are optimized to ingest time-series data at high speed and greater efficiency than general purpose SQL or NoSQL databases. The fact that TimescaleDB is built on PostgreSQL means that it can take advantage of the flexibility and stability of PostgreSQL. Although there are many other time-series databases, TimescaleDB is easy to use when coupled with PaaS (platform-as-a-service) PostgreSQL offerings such as Azure Database for PostgreSQL.

Microsoft Face API with a local container

A few days ago, I obtained access to the Face container. It provides access to the Face API via a container you can run where you want: on your pc, at the network edge or in your datacenter. You should allocate 6 GB or RAM and 2 cores for the container to run well. Note that you still need to create a Face API resource in the Azure Portal. The container needs to be associated with the Azure Face API via the endpoint and access key:

Face API with a West Europe (Amsterdam) endpoint

I used the Standard tier, which charges 0.84 euros per 1000 calls. As noted, the container will not function without associating it with an Azure Face API resource.

When you gain access to the container registry, you can pull the container:

docker pull containerpreview.azurecr.io/microsoft/cognitive-services-face:latest

After that, you can run the container as follows (for API billing endpoint in West Europe):

docker run --rm -it -p 5000:5000 --memory 6g --cpus 2 containerpreview.azurecr.io/microsoft/cognitive-services-face Eula=accept Billing=https://westeurope.api.cognitive.microsoft.com/face/v1.0 ApiKey=YOUR_API_KEY

The container will start. You will see the output (–it):

Running Face API container

And here’s the spec:

API spec Face API v1

Before showing how to use the detection feature, note that the container needs Internet access for billing purposes. You will not be able to run the container in fully offline scenarios.

Over at https://github.com/gbaeke/msface-go, you can find a simple example in Go that uses the container. The Face API can take a byte stream of an image or a URL to an image. The example takes the first approach and loads an image from disk as specified by the -image parameter. The resulting io.Reader is passed to the getFace function which does the actual call to the API (uri = http://localhost:5000/face/v1.0/detect):

request, err := http.NewRequest("POST", uri+"?returnFaceAttributes="+params, m)
request.Header.Add("Content-Type", "application/octet-stream")

// Send the request to the local web service
resp, err := client.Do(request)
if err != nil {
    return "", err
}

The response contains a Body attribute and that attribute is unmarshalled to a variable of type interface. That one is marshalled with indentation to a byte slice (b) which is returned by the function as a string:

var response interface{}
err = json.Unmarshal(respBody, &response)
if err != nil {
    return "", err
}
b, err := json.MarshalIndent(response, "", "\t")

Now you can use a picture like the one below:

Is he smiling?

Here are some parts of the input, following the command
detectface -image smiling.jpg

Emotion is clearly happiness with additional features such as age, gender, hair color, etc…

[
{
"faceAttributes": {
"accessories": [],
"age": 33,
"blur": {
"blurLevel": "high",
"value": 1
},
"emotion": {
"anger": 0,
"contempt": 0,
"disgust": 0,
"fear": 0,
"happiness": 1,
"neutral": 0,
"sadness": 0,
"surprise": 0
},
"exposure": {
"exposureLevel": "goodExposure",
"value": 0.71
},
"facialHair": {
"beard": 0.6,
"moustache": 0.6,
"sideburns": 0.6
},
"gender": "male",
"glasses": "NoGlasses",
"hair": {
"bald": 0.26,
"hairColor": [
{
"color": "black",
"confidence": 1
}],
"faceId": "b6d924c1-13ef-4d19-8bc9-34b0bb21f0ce",
"faceRectangle": {
"height": 1183,
"left": 944,
"top": 167,
"width": 1183
}
}
]

That’s it! Give the Face API container a go with the tool. You can get it here: https://github.com/gbaeke/msface-go/releases/tag/v0.0.1 (Windows)

Infrastructure as Code: exploring Pulumi

Image: from the Pulumi website

In my Twitter feed, I often come across Pulumi so I decided to try it out. Pulumi is an Infrastructure as Code solution that allows you to use familiar development languages such as JavaScript, Python and Go. The idea is that you define your infrastructure in the language that you prefer, versus some domain specific language. When ready, you merely use pulumi¬†up to deploy your resources (and pulumi¬†update, pulumi¬†destroy, etc…). The screenshot below shows the deployment of an Azure resource group, storage account, file share and a container group on Azure Container Instances. The file share is mapped as a volume to one of the containers in the container group:

Deploying infrastructure with pulumi up

Installation is extremely straightforward. I chose to write the code in JavaScript as I had all the tools already installed on my Windows box. It is also more polished than the Go option (for now). I installed Pulumi per their instructions over at https://pulumi.io/quickstart/install.html.

Next, I used their cloud console to create a new project. Eventually, you will need to run a pulumi new command on your local machine. The cloud console will provide you with the command to use which is handy when you are just getting started. The cloud console provides a great overview of all your activities:

Nice and green (because I did not include the failed ones ūüėČ)

In Resources, you can obtain a graph of the deployed resources:

Don’t you just love pretty graphs like this?

Let’s take a look at the code. The complete code is in the following gist: https://gist.github.com/gbaeke/30ae42dd10836881e7d5410743e4897c.

Resource group, storage account and share

The above code creates the resource group, storage account and file share. It is so straightforward that there is no need to explain it, especially if you know how it works with ARM. The simplicity of just referring to properties of resources you just created is awesome!

Next, we create a container group with two containers:

Creating the container group

If you have ever created a container group with a YAML file or ARM template, the above code will be very familiar. It defines a DNS label for the group and sets the type to Linux (ACI also supports Windows). Then two containers are added. The realtime-go container uses CertMagic to obtain Let’s Encrypt certificates. The certificates should be stored in persistent storage and that is what the Azure File Share is used for. It is mounted on /.local/share/certmagic because that is where the files will be placed in a scratch container.

I did run into a small issue with the container group. The realtime-go container should expose both port 80 and 443 but the port setting is a single numeric value. In YAML or ARM, multiple ports can be specified which makes total sense. Pulumi has another cross-cloud option to deploy containers which might do the trick.

All in all, I am pleasantly surprised with Pulumi. It’s definitely worth a more in-depth investigation!

Azure API Management Consumption Tier

In the previous post, I talked about a personal application I use to deploy Azure resources to my lab subscription. The architecture is pretty straightforward:

After obtaining an id token from Azure Active directory (v1 endpoint), API calls go to API Management with the token in the authorization HTTP header.

API Management is available in several tiers:

API Management tiers

The consumption tier, with its 1.000.000 free calls per month per Azure subscription naturally is the best fit for this application. I do not need virtual network support or multi-region support or even Active Directory support. And I don’t want the invoice either! ūüėČ Note that the lack of Active Directory support has nothing to do with the ability to verify the validity of a JWT (JSON Web Token).

I created an instance in West Europe but it gave me errors while adding operations (like POSTs or GETs). It complained about reaching the 1000 operations limit. Later, I created an instance in North Europe which had no issues.

Define a product

A product contains one or more APIs and has some configuration such as quotas. You can read up on API products here. You can also add policies at the product level. One example of a policy is a JWT check, which is exactly what I needed. Another example is adding basic authentication to the outgoing call:

Policies at the product level

The first policy, authentication, configures basic authentication and gets the password from the BasicAuthPassword named value:

Named values in API Management

The second policy is the JWT check. Here it is in full:

JWT Policy

The policy checks the validity of the JWT and returns a 401 error if invalid. The openid-config url points to a document that contains useful information to validate the JWT, including a pointer to the public keys that can be used to verify the JWT’s signature (https://login.microsoftonline.com/common/discovery/keys). Note that I also check for the name claim to match mine.

Note that Active Directory is also configured to only issue a token to me. This is done via Enterprise Applications in https://aad.portal.azure.com.

Creating the API

With this out of the way, let’s take a look at the API itself:

Azure Deploy API and its defined operations

The operations are not very RESTful but they do the trick since they are an exact match with the webhookd server’s endpoints.

To not end up with CORS errors, All Operations has a CORS policy defined:

CORS policy at the All operations level

Great! The front-end can now authenticate to Azure AD and call the API exposed by API management. Each call has the Azure AD token (a JWT) in the authorization header so API Management van verify the token’s validity and pass along the request to webhookd.

With the addition of the consumption tier, it makes sense to use API Management in many more cases. And not just for smaller apps like this one!

Azure Front Door in front of a static website

In the previous post, I wrote about hosting a simple static website on an Azure Storage Account. To enable a custom URL such as https://blog.baeke.info, you can add Azure CDN. If you use the Verizon Premium tier, you can configure rules such as a http to https redirect rule. This is similar to hosting static sites in an Amazon S3 bucket with Amazon CloudFront although it needs to be said that the http to https redirect is way simpler to configure there.

On Twitter, Karim Vaes reminded me of the Azure Front Door service, which is currently in preview. The tagline of the Azure Front Door service is clear:¬†“scalable¬†and¬†secure¬†entry¬†point¬†for¬†fast¬†delivery¬†of¬†your¬†global¬†applications”.

Azure Front Door Service Preview

The Front Door service is quite advanced and has features like global HTTP load balancing with instant failover, SSL offload, application acceleration and even application firewalling and DDoS protection. The price is lower that the Verizon Premium tier of Azure CDN. Please note that preview pricing is in effect at this moment.

Configuring a Front Door with the portal is very easy with the Front Door Designer. The screenshot below shows the designer for the same website as the previous post but for a different URL:

Front Door Designer

During deployment, you create a name that ends in azurefd.net (here geba.azurefd.net). Afterwards you can add a custom name like deploy.baeke.info in the above example. Similar to the Azure CDN, Front Door will give you a Digicert issued certificate if you enable HTTPS and choose Front Door managed:

Front Door managed SSL certificate

Naturally, the backend pool will refer to the https endpoint of the static website of your Azure Storage Account. I only have one such endpoint, but I could easily add another copy and start load balancing between the two.

In the routing rule, be sure you select the frontend host that matches the custom domain name you set up in the frontend hosts section:

Routing rule

It is still not as easy as in CloudFront to redirect http to https. For my needs, I can allow both http and https to Front Door and redirect in the browser:

if(window.location.href.substr(0,5) !== 'https'){
window.location.href = window.location.href.replace('http', 'https');
}

Not as clean as I would like it but it does the job for now. I can now access https://deploy.baeke.info via Front Door!