use of the Event Hub enqueuedTime timestamp instead of NOW() in the INSERT statement (yes, I know, using NOW() did not make sense š)
make the code idempotent to handle duplicates (basically do nothing when a unique constraint is violated)
In general, I prefer to use application time (time at the event publisher) versus the time the message was enqueued. If you don’t have that timestamp, enqueuedTime is the next best thing.
How can we optimize the function even further? Read on about the cardinality setting!
Event Hub trigger cardinality setting
Our JavaScript Azure Function has its settings in function.json. For reference, here is its content:
Clearly, the function uses the eventHubTrigger for an Event Hub called hub-pg. In connection, EH refers to an Application Setting which contains the connections string to the Event Hub. Yes, I excel at naming stuff! The Event Hub has defined a consumer group called pg that we are using in this function.
The cardinality setting is currently set to “one”, which means that the function can only process one message at a time. As a best practice, you should use a cardinality of “many” in order to process batches of messages. A setting of “many” is the default.
To make the required change, modify function.json and set cardinality to “many”. You will also have to modify the Azure Function to process a batch of messages versus only one:
Processing batches of messages
With cardinality set to many, the IoTHubMessages parameter of the function is now an array. To retrieve the enqueuedTime from the messages, grab it from the enqueuedTimeUtcArray array using the index of the current message. Notice I also switched to JavaScript template literals to make the query a bit more readable.
The number of messages in a batch is controlled by maxBatchSize in host.json. By default, it is set to 64. Another setting,prefetchCount, determines how many messages are retrieved and cached before being sent to your function. When you change maxBatchSize, it is recommended to set prefetchCount to twice the maxBatchSize setting. For instance:
It’s great to have these options but how should you set them? As always, the answer is in this book:
A great resource to get a feel for what these settings do is this article. It also comes with a Power BI report that allows you to set the parameters to see the results of load tests.
Conclusion
In this post, we used the function.json cardinality setting of “many” to process a batch of messages per function call. By default, Azure Functions will use batches of 64 messages without prefetching. With the host.json settings of maxBatchSize and prefetchCount, that can be changed to better handle your scenario.
In a previous post, I talked about saving time-series data to TimescaleDB, which is an extension on top of PostgreSQL. The post used an Azure Function with an Event Hub trigger to save the data in TimescaleDB with a regular INSERT INTO statement.
The Function App used the Windows runtime which gave me networking errors (ECONNRESET) when connecting to PostgreSQL. I often encounter those issues with the Windows runtime. In general, for Node.js, I try to stick to the Linux runtime whenever possible. In this post, we will try the same code but with a Function App that uses the Linux runtime in a Consumption Plan.
Make sure Azure CLI is installed and that you are logged in. First, create a Storage Account:
Now, in the Function App, set the following Application Settings. These settings will be used in the code we will deploy later.
host: hostname of the PostgreSQL server (e.g. servername.postgres.database.azure.com)
user: user name (e.g. user@servername)
password
database: name of the PostgreSQL database
EH: connection string to the Event Hub interface of your IoT Hub; if your are unsure how to set this, see this post
You can set the above values from the Azure Portal:
Application Settings of the Function App
The function uses the first four Application Settings in the function code via process.env:
Using Application Settings in JavaScript
The application setting EH is used to reference the Event Hub in function.json:
function.json with Event Hub details such as the connection, cardinality and the consumerGroup
Now let’s get the code from my GitHub repo in the Azure Function. First install Azure Function Core Tools 2.x. Next, create a folder called funcdemo. In that folder, run the following commands:
git clone https://github.com/gbaeke/pgfunc.git cd pgfunc npm install az login az account show
The npm install command installs the pg module as defined in package.json. The last two commands log you in and show the active subscription. Make sure that subscription contains the Function App you deployed above. Now run the following command:
func init
Answer the questions: we use Node and JavaScript. You should now have a local.settings.json file that sets the FUNCTIONS_WORKER_RUNTIME to node. If you do not have that, the next command will throw an error.
Now issue the following command to package and deploy the function to the Function App we created earlier:
func azure functionapp publish funclinux
This should result in the following feedback:
Feedback from function deployment
You should now see the function in the Function App:
Deployed function
To verify that the function works as expected, I started my IoT Simulator with 100 devices that send data every 5 seconds. I also deleted all the existing data from the TimescaleDB hypertable. The Live Metrics stream shows the results. In this case, the function is running smoothly without connection reset errors. The consumption plan spun up 4 servers:
Live Metrics Stream of IoT Hub to PostgreSQL function
In IoT projects, the same question always comes up: “Where do we store our telemetry data?”. As usual, the answer to that question is not straightforward. We have seen all kinds of solutions in the wild:
save directly to a relational database (SQL Server, MySQL, …)
save to a data lake and/or SQL
save to Cosmos DB or similar (e.g. MongoDB)
save to Azure Table Storage or similar
save to Time Series Insights
Saving the data to a relational database is often tempting. It fits in existing operational practices and it is easy to extract, transform and visualize the data. In practice, I often recommend against this approach except in the simplest of use cases. The reason is clear: these databases are not optimized for fast ingestion of time-series data. Instead, you should use a time-series database which is optimized for fast ingest and efficient processing of time-series data.
TimescaleDB
TimescaleDB is a an open-source time-series databases optimized for fast ingest even when the amount of data stored becomes large. It does not stand on its own, as it runs on PostgreSQL as an extension. Note that you can store time-series in a regular table or as a TimescaleDB hypertable. The graphic below (from this post), shows the difference:
Test on general purpose compute Gen 5 with 8 vCores, 45GB RAM with Premium Storage
The difference is clear. With a regular table, the insert rate is lowered dramatically when the amount of data becomes large.
The TimescaleDB extension can easily be installed on Azure Database for PostgreSQL. Let’s see how that goes shall we?
Installing TimescaleDB
To create an Azure Database for PostgreSQL instance, I will use the Azure CLI with the db-up extension:
az postgres up -g RESOURCEGROUP -s SERVERNAME -d DBNAME -u USER -p PASSWORD
The server name you provide should result in a unique URL for your database (e.g. servername.postgres.database.azure.com).
Tip: do not use admin as the user name š
When the server has been provisioned, modify the server confguration for the TimescaleDB extension:
az postgres server configuration set --resource-group RESOURCEGROUP āāserver-name SERVERNAME --name shared_preload_libraries --value timescaledb
Now you need to actually install the extension. Install pgAdmin and issue the following query:
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
In pgAdmin, you should see extra schemas for TimescaleDB:
Extra schemas for TimescaleDB
Creating a hypertable
A hypertable uses partitioning to optimize writing and reading time-series data. Creating such a table is straightforward. You start with a regular table:
CREATE TABLE conditions ( time TIMESTAMPTZ NOT NULL, location TEXT NOT NULL, temperature DOUBLE PRECISION NULL, humidity DOUBLE PRECISION NULL );
Next, convert to a hypertable:
SELECT create_hypertable('conditions', 'time');
The above command partitions the data by time, using the values in the time column. By default, the time interval for partitioning is set to 7 days, starting from version 0.11.0 of TimescaleDB. You can override this by setting chunk_time_interval when creating the hypertable. You should make sure that the chunk belonging to the most recent interval can fit into memory. According to best practices, such a chunk should not use more than 25% of main memory.
Now that we have the hypertable, we can write time-series data to it. One advantage of being built on top of a relational database such as PostgreSQL is that you can use standard SQL INSERT INTO statements. For example:
The example above is from an Azure Function we will look at in a moment. In extracts values from a message received via IoT Hub and inserts them into the hypertable via an INSERT INTO query.
Let’s take a look at the Azure Function next.
Azure Function: from IoT Hub to the Hypertable
The Azure Function is kept bare bones to focus on the essentials. Note that you will need to open the console and install the pg module with the following command:
npm install pg
The image below shows the Azure Function (based on this although the article does not use a hypertable and stores the telemetry as JSON).
Bare bones Azure Function to write IoT Hub data to the hypertable
Naturally, the Azure Function above requires an Azure Event Hubs trigger. In this case, event hub cardinality was set to One. More information here. Note that you should NOT use the NOW() function to set the time. It’s only used here for demo purposes. Instead, you should take the timestamp sent by the device or the time the data was queued at the Event Hub!
Naturally, you will also need an IoT Hub where you send your data. In this case, I created a standard IoT Hub and used the IoT Hub Visual Studio Code extension to generate code (1 in image below) to send sample messages. I modified the code somewhat to include the device name (2 in image below):
Visual Studio toolkit used to create the device, generate code and modify the code
Now we can run the code (saved as sender.js) with:
node sender.js
Note: do not forget to first run npm install azure-iot-device
Data is being sent:
Data sent to IoT Hub
Data processed by the Azure Function as viewed in Application Insights Live Metrics stream:
Application Insights Live Metrics stream
With only one device sending data, there isn’t that much to do! In pgAdmin, you should see connections from at least one of the Azure Function hosts that are active:
Connections to PostgreSQL
Note: I encountered some issues with ECONNRESET errors under higher load; take a look at this post which runs the same function on a Linux Consumption Plan
Querying the data
TimescaleDB, with help from PostgreSQL, has a rich query language especially when compared to some other offerings. Yes, I am looking at you Cosmos DB! š Below are some examples (based on the documentation at https://docs.timescale.com/v1.2/using-timescaledb/reading-data:
SELECT COUNT(*) FROM conditions WHERE time > NOW() - interval '1 minute';
The above query simply counts the messages in the last minute. Notice the flexibility in expressing the time which is what we want from time-series databases.
SELECT time_bucket('1 minute', time) AS one_min, device, COUNT(*), MAX(temperature) AS max_temp, MAX(humidity) AS max_hum FROM conditions WHERE time > NOW() - interval '10 minutes' GROUP BY one_min, device ORDER BY one_min DESC, max_temp DESC;
The above query displays the following result:
Conclusion
When dealing with time-series data, it is often beneficial to use a time-series database. They are optimized to ingest time-series data at high speed and greater efficiency than general purpose SQL or NoSQL databases. The fact that TimescaleDB is built on PostgreSQL means that it can take advantage of the flexibility and stability of PostgreSQL. Although there are many other time-series databases, TimescaleDB is easy to use when coupled with PaaS (platform-as-a-service) PostgreSQL offerings such as Azure Database for PostgreSQL.