Storing and querying for embeddings with Redis

In a previous post, we wrote about using vectorized search and cosine similarity to quickly query a database of blog posts and retrieve the most relevant content to a natural language query. This is achieved using OpenAI’s embeddings API, Pinecone (a vector database), and OpenAI ChatCompletions. For reference, here’s the rough architecture:

Vectorized search with Pinecone

The steps above do the following:

  1. A console app retrieves blog post URLs from an RSS feed and reads all the posts one by one
  2. For each post, create an embedding with OpenAI which results in a vector of 1536 dimensions to store in Pinecone
  3. After the embedding is created, store the embedding in a Pinecone index; we created the index from the Pinecone portal
  4. A web app asks the user for a query (e.g., “How do I create a chat bot?”) and creates an embedding for the query
  5. Perform a vectorized search, finding the closest post vectors to the query vector using cosine similarity and keep the one with the highest score
  6. Use the ChatCompletion API and submit the same query but add the highest scoring post as context to the user question. The post text is injected into the prompt

ℹ️ See Pinecone and OpenAI magic: A guide to finding your long lost blog posts with vectorized search and ChatGPT – for more information.

We can replace Pinecone with Redis, a popular open-source, in-memory data store that can be used as a database, cache, and message broker. Redis is well-suited for this task as it can also store vector representations of our blog posts and has the capability to perform vector queries efficiently.

You can easily run Redis with Docker for local development. In addition, Redis is available in Azure, although you will need the Enterprise version. Only Azure Cache for Redis Enterprise supports the RediSearch functionality and that’s what we need here! Note that the Enterprise version is quite costly.

By leveraging Redis for vector storage and querying, we can harness its high performance, flexibility, and reliability in our solution while maintaining the core functionality of quickly querying and retrieving the most relevant blog post content using vectorized search and similarity queries.

ℹ️ The code below shows snippets. Full samples (yes, samples 😀) are on GitHub: check to upload posts to a local Redis instance and to test the query functionality.

Run Redis with Docker

If you have Docker on your machine, use the following command:

docker run --name redis-stack-server -p 6380:6379 redis/redis-stack-server:latest

ℹ️ I already had another instance of Redis running on port 6379 so I mapped port 6380 on localhost to port 6379 of the redis-stack-server container.

If you want a GUI to explore your Redis instance, install RedisInsight. The screenshot below shows the blog posts after uploading them as Redis hashes.

RedisInsight in action

Let’s look at creating the hashes next!

Storing post data in Redis hashes

We will create several Redis hashes, one for each post. Hashes are records structured as collections of field-value pairs. Each hash we store, has the following fields:

  • url: url to the blog post
  • embedding: embedding of the blog post (a vector), created with the OpenAI embeddings API and the text-embedding-ada-002 model

We need the URL to retrieve the entire post after a closest match has been found. In Pinecone, the URL would be metadata to the vector. In Redis, it’s just a field in a hash, just like the vector itself.

In RedisInsight, a hash is shown as below:

Redis hash for post 0 with url and embedding fields

The embedding field in the hash has no special properties. The vector is simply stored as a series of bytes. To store the urls and embeddings of posts, we can use the following code:

import redis
import openai
import os
import requests
from bs4 import BeautifulSoup
import feedparser

# OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

# Redis connection details
redis_host = os.getenv('REDIS_HOST')
redis_port = os.getenv('REDIS_PORT')
redis_password = os.getenv('REDIS_PASSWORD')

# Connect to the Redis server
conn = redis.Redis(host=redis_host, port=redis_port, password=redis_password, encoding='utf-8', decode_responses=True)

# URL of the RSS feed to parse
url = ''

# Parse the RSS feed with feedparser
feed = feedparser.parse(url)

p = conn.pipeline(transaction=False)
for i, entry in enumerate(feed.entries[:50]):
    # report progress
    print("Create embedding and save for entry ", i, " of ", entries)

    r = requests.get(
    soup = BeautifulSoup(r.text, 'html.parser')
    article = soup.find('div', {'class': 'entry-content'}).text

    # vectorize with OpenAI text-emebdding-ada-002
    embedding = openai.Embedding.create(

    # print the embedding (length = 1536)
    vector = embedding["data"][0]["embedding"]

    # convert to numpy array and bytes
    vector = np.array(vector).astype(np.float32).tobytes()

    # Create a new hash with url and embedding
    post_hash = {
        "embedding": vector

    # create hash
    conn.hset(name=f"post:{i}", mapping=post_hash)


In the above code, note the following:

  • The OpenAI embeddings API returns a JSON document that contains the embedding for each post; the embedding is retrieved with vector = embedding["data"][0]["embedding"]
  • The resulting vector is converted to bytes with vector = np.array(vector).astype(np.float32).tobytes(); serializing the vector this way is required to store the vector in the Redis hash
  • the Redis hset command is used to store the field-value pairs (these pairs are in a Python dictionary called post_hash) with a key that is prefixed with post: followed by the document number. The prefix will be used later by the search index we will create

Now we have our post information in Redis hashes, we want to use RediSearch functionality to match an input query with one or more of our posts. RediSearch supports vector similarity semantic search. For such a search to work, we will need to create an index that knows there is a vector field. On such indexes, we can perform vector similarity searches.

Creating an index

To create an index with Python code, check the code below:

import redis
from import VectorField, TextField
from import Query
from import IndexDefinition, IndexType

# Redis connection details
redis_host = os.getenv('REDIS_HOST')
redis_port = os.getenv('REDIS_PORT')
redis_password = os.getenv('REDIS_PASSWORD')

# Connect to the Redis server
conn = redis.Redis(host=redis_host, port=redis_port, password=redis_password, encoding='utf-8', decode_responses=True)

    VectorField("embedding", "HNSW", {"TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE"}),

# Create the index
    conn.ft("posts").create_index(fields=SCHEMA, definition=IndexDefinition(prefix=["post:"], index_type=IndexType.HASH))
except Exception as e:
    print("Index already exists")

When creating an index, you define the fields to index based on a schema. Above, we include both the text field (url) and the vector field (embedding). The VectorField class is used to construct the vector field and takes several parameters:

  • Name: the name of the field (“embedding” here but could be anything)
  • Algorithm: “FLAT” or “HNSW”; use “FLAT” when search quality is of high priority and search speed is less important; “HNSW” gives you faster querying; for more information see this article
  • Attributes: a Python dictionary that specifies the data type, the number of dimensions of the vector (1536 for text-embedding-ada-002) and the distance metric; here we use COSINE for cosine similarity, which is recommended by OpenAI with their embedding model

ℹ️ It’s important to get the dimensions right or your index will fail to build properly. It will not be immediately clear that it failed, unless you run FT.INFO <indexname> with redis-cli.

With the schema out of the way, we can now create the index with:

conn.ft("posts").create_index(fields=SCHEMA, definition=IndexDefinition(prefix=["post:"], index_type=IndexType.HASH))

The index we create is called posts. We index the fields defined in SCHEMA and only index hashes with a key prefix of post:. The hashes we created earlier, all have this prefix. With the index created and our existing hashes, the index should be populated with them. Ensure you can see that in RedisInsight:

posts index populated with hashes that were added earlier

Redis vector queries

With the hashes and the index created, we can now perform a similarity search. We will ask the user for a query string (use natural language) and then check the posts that are similar to the query string. The query string will need to be vectorized as well. We will return several post and rank them.

import numpy as np
from import Query
import redis
import openai
import os

openai.api_key = os.getenv('OPENAI_API_KEY')

def search_vectors(query_vector, client, top_k=5):
    base_query = "*=>[KNN 5 @embedding $vector AS vector_score]"
    query = Query(base_query).return_fields("url", "vector_score").sort_by("vector_score").dialect(2)    

        results = client.ft("posts").search(query, query_params={"vector": query_vector})
    except Exception as e:
        print("Error calling Redis search: ", e)
        return None

    return results

# Redis connection details
redis_host = os.getenv('REDIS_HOST')
redis_port = os.getenv('REDIS_PORT')
redis_password = os.getenv('REDIS_PASSWORD')

# Connect to the Redis server
conn = redis.Redis(host=redis_host, port=redis_port, password=redis_password, encoding='utf-8', decode_responses=True)

    print("Connected to Redis")

# Enter a query
query = input("Enter your query: ")

# Vectorize the query using OpenAI's text-embedding-ada-002 model
print("Vectorizing query...")
embedding = openai.Embedding.create(input=query, model="text-embedding-ada-002")
query_vector = embedding["data"][0]["embedding"]

# Convert the vector to a numpy array
query_vector = np.array(query_vector).astype(np.float32).tobytes()

# Perform the similarity search
print("Searching for similar posts...")
results = search_vectors(query_vector, conn)

if results:
    print(f"Found {} results:")
    for i, post in enumerate(
        score = 1 - float(post.vector_score)
        print(f"\t{i}. {post.url} (Score: {round(score ,3) })")
    print("No results found")

In the above code, the following happens:

  • Set OpenAI API key: needed to create the embedding for the query typed by the user
  • Connect to Redis based on the environment variables and check the connection with ping().
  • Ask the user for a query
  • Create the embedding from the query string and convert the array to bytes
  • Call the search_vectors function with the vectorized query string and Redis connection as parameters

The search_vectors function uses RediSearch capabilities to query over our hashes and calculate the 5 nearest neighbors to our query vector. Querying is explained in detail in the Redis documentation but it can be a bit dense. You start with the base query:

 base_query = "*=>[KNN 5 @embedding $vector AS vector_score]"

This is just a string with the query format that Redis expects to pass to the Query class in the next step. We are looking for the 5 nearest neighbors of $vector in the embedding fields of the hashes. You use @ to denote the embedding field and $ to denote the vector we will pass in later. That vector is our vectorized query string. With AS vector_score, we add the score to later rank the results from high to low.

The actual query is built with the Query class (one line):

query = Query(base_query).return_fields("url", "vector_score").sort_by("vector_score").dialect(2)    

We return the url and the vector_score and sort on this score. Dialect is just the version of the query language. Here we use dialect 2 as that matches the query syntax. Using an earlier dialect would not work here.

Of course, this still does not pass the query vector to the query. That only happens when we run the query in Redis with:

results = client.ft("posts").search(query, query_params={"vector": query_vector})

The above code performs a search query on the posts index. In the call to the search method, we pass the query we built earlier and a list of query parameters. We only have one parameter, the vector parameter ($vector in base_query) and the value for this parameter is the embedding created from the user query string.

When I query for bot, I get the following results:

Our 5 query results

The results are ranked with the closest match first. We could use that match to grab the post from the URL and send the query to OpenAI ChatCompletion API to answer the question more precisely. For better results, use a better query like “How do I build a chat bot in Python with OpenAI?”. To get an idea of how to do that, check my previous post.


In this post we discussed storing embeddings in Redis and querying embeddings with a similarity search. If you combine this with my previous post, you can use Redis instead of Pinecone as the vector database and query engine. This can be useful for Azure customers because Azure has Azure Cache for Redis Enterprise, a fully managed service that supports the functionality discussed in this post. In addition, it is useful for local development purposes because you can easily run Redis with Docker.

Pinecone and OpenAI magic: A guide to finding your long lost blog posts with vectorized search and ChatGPT

Searching through a large database of blog posts can be a daunting task, especially if there are thousands of articles. However, using vectorized search and cosine similarity, you can quickly query your blog posts and retrieve the most relevant content.

In this blog post, we’ll show you how to query a list of blog posts (from this blog) using a combination of vectorized search with cosine similarity and OpenAI ChatCompletions. We’ll be using OpenAI’s embeddings API to vectorize the blog post articles and Pinecone, a vector database, to store and query the vectors. We’ll also show you how to retrieve the contents of the article, create a prompt using the ChatCompletion API, and return the result to a web page.

ℹ️ Sample code is on GitHub:

ℹ️ If you want an introduction to embeddings and cosine similarity, watch the video on YouTube by Part Time Larry.

Setting Up Pinecone

Before we can start querying our blog posts, we need to set up Pinecone. Pinecone is a vector database that makes it easy to store and query high-dimensional data. It’s perfect for our use case since we’ll be working with high-dimensional vectors.

ℹ️ Using a vector database is not strictly required. The GitHub repo contains, which uses scikit-learn to create the vectors and perform a cosine similarity search. Many other approaches are possible. Pinecone just makes storing and querying the vectors super easy.

ℹ️ If you want more information about Pinecone and the concept of a vector database, watch this introduction video.

First, we’ll need to create an account with Pinecone and get the API key and environment name. In the Pinecone UI, you will find these as shown below. There will be a Show Key and Copy Key button in the Actions section next to the key.

Key and environment for Pinecone

Once we have an API key and the environment, we can use the Pinecone Python library to create and use indexes. Install the Pinecone library with pip install pinecone-client.

Although you can create a Pinecone index from code, we will create the index in the Pinecone portal. Go to Indexes and select Create Index. Create the index using cosine as metric and 1536 dimensions:

blog-index in Pinecone

The embedding model we will use to create the vectors, text-embedding-ada-002, outputs vectors with 1536 dimensions. For more info see OpenAI’s blog post of December 15, 2022.

To use the Pinecode index from code, look at the snippet below:

import pinecone

pinecone_api = "<your_api_key>"
pinecone_env = "<your_environment>"

pinecone.init(api_key=pinecone_api, environment=pinecone_env)

index = pinecone.Index('blog-index')

We create an instance of the Index class with the name “blog-index” and store this in index. This index will be used to store our blog post vectors or to perform searches on.

Vectorizing Blog Posts with OpenAI’s Embeddings API

Next, we’ll need to vectorize our blog post articles. We’ll be using OpenAI’s embeddings API to do this. The embeddings API takes a piece of text and returns a high-dimensional vector representation of that text. Here’s an example of how to do that for one article or string:

import openai

openai.api_key = "<your_api_key>"

article = "Some text from a blog post"

vector = openai.Embedding.create(

We create a vector representation of our blog post article by calling the Embedding class’s create method. We pass in the article text as input and the text-embedding-ada-002 model, which is a pre-trained language model that can generate high-quality embeddings.

Storing Vectors in Pinecone

Once we have the vector representations of our blog post articles, we can store them in Pinecone. Instead of storing vector per vector, we can use upsert to store a list of vectors. The code below uses the feed of this blog to grab the URLs for 50 posts, every post is vectorized and the vector is added to a Python list of tuples, as expected by the upsert method. The list is then added to Pinecone at once. The tuple that Pinecone expects is:

(id, vector, metadata dictionary)

e.g. (0, vector for post 1, {"url": url to post 1}

Here is the code that uploads the first 50 posts of to Pinecone. You need to set the Pinecone key and environment and the OpenAI key as environment variables. The code uses feedparser to grab the blog feed, and BeatifulSoup to parse the retrieved HTML. The code serves as an example only. It is not very robust when it comes to error checking etc…

import feedparser
import os
import pinecone
import numpy as np
import openai
import requests
from bs4 import BeautifulSoup

# OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

# get the Pinecone API key and environment
pinecone_api = os.getenv('PINECONE_API_KEY')
pinecone_env = os.getenv('PINECONE_ENVIRONMENT')

pinecone.init(api_key=pinecone_api, environment=pinecone_env)

# set index; must exist
index = pinecone.Index('blog-index')

# URL of the RSS feed to parse
url = ''

# Parse the RSS feed with feedparser
feed = feedparser.parse(url)

# get number of entries in feed
entries = len(feed.entries)
print("Number of entries: ", entries)

post_texts = []
pinecone_vectors = []
for i, entry in enumerate(feed.entries[:50]):
    # report progress
    print("Processing entry ", i, " of ", entries)

    r = requests.get(
    soup = BeautifulSoup(r.text, 'html.parser')
    article = soup.find('div', {'class': 'entry-content'}).text

    # vectorize with OpenAI text-emebdding-ada-002
    embedding = openai.Embedding.create(

    # print the embedding (length = 1536)
    vector = embedding["data"][0]["embedding"]

    # append tuple to pinecone_vectors list
    pinecone_vectors.append((str(i), vector, {"url":}))

# all vectors can be upserted to pinecode in one go
upsert_response = index.upsert(vectors=pinecone_vectors)

print("Vector upload complete.")

Querying Vectors with Pinecone

Now that we have stored our blog post vectors in Pinecone, we can start querying them. We’ll use cosine similarity to find the closest matching blog post. Here is some code that does just that:

query_vector = <vector representation of query>  # vector created with OpenAI as well

search_response = index.query(

url = get_highest_score_url(search_response['matches'])

def get_highest_score_url(items):
    highest_score_item = max(items, key=lambda item: item["score"])

    if highest_score_item["score"] > 0.8:
        return highest_score_item["metadata"]['url']
        return ""

We create a vector representation of our query (you don’t see that here but it’s the same code used to vectorize the blog posts) and pass it to the query method of the Pinecone Index class. We set top_k=5 to retrieve the top 5 matching blog posts. We also set include_metadata=True to include the metadata associated with each vector in our response. That way, we also have the URL of the top 5 matching posts.

The query method returns a dictionary that contains a matches key. The matches value is a list of dictionaries, with each dictionary representing a matching blog post. The score key in each dictionary represents the cosine similarity score between the query vector and the blog post vector. We use the get_highest_score_url function to find the blog post with the highest cosine similarity score.

The function contains some code to only return the highest scoring URL if the score is > 0.8. It’s of course up to you to accept lower matching results. There is a potential for the vector query to deliver an article that’s not highly relevant which results in an irrelevant context for the OpenAI ChatCompletion API call we will do later.

Retrieving the Contents of the Blog Post

Once we have the URL of the closest matching blog post, we can retrieve the contents of the article using the Python requests library and the BeautifulSoup library.

import requests
from bs4 import BeautifulSoup

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

article = soup.find('div', {'class': 'entry-content'}).text

We send a GET request to the URL of the closest matching blog post and retrieve the HTML content. We use the BeautifulSoup library to parse the HTML and extract the contents of the <div> element with the class “entry-content”.

Creating a Prompt for the ChatCompletion API

Now that we have the contents of the blog post, we can create a prompt for the ChatCompletion API. The crucial part here is that our OpenAI query should include the blog post we just retrieved!

response = openai.ChatCompletion.create(
        { "role": "system", "content": "You are a polite assistant" },
        { "role": "user", "content": "Based on the article below, answer the following question: " + your_query +
            "\nAnswer as follows:" +
            "\nHere is the answer directly from the article:" +
            "\nHere is the answer from other sources:" +
             "\n---\n" + article }


We use the ChatCompletion API with the gpt-3.5-turbo model to ask our question. This is the same as using ChatGPT on the web with that model. At this point in time, the GPT-4 model was not available yet.

Instead of one prompt, we send a number of dictionaries in a messages list. The first item in the list sets the system message. The second item is the actual user question. We ask to answer the question based on the blog post we stored in the article variable and we provide some instructions on how to answer. We add the contents of the article to our query.

If the article is long, you run the risk of using too many tokens. If that happens, the ChatCompletion call will fail. You can use the tiktoken library to count the tokens and prevent the call to happen in the first place. Or you can catch the exception and tell the user. In the above code, there is no error handling. We only include the core code that’s required.

Returning the Result to a Web Page

If you are running the search code in an HTTP handler as the result of the user typing a query in a web page, you can return the result to the caller:

return jsonify({
    'url': url,
    'response': response_text

The full example, including an HTML page and Flask code can be found on GitHub.

The result could look like this:

Query results in the closest URL using vectorized search and ChatGPT answering the question based on the contents the URL points at


Using vectorized search and cosine similarity, we can quickly query a database of blog posts and retrieve the most relevant post. By combining OpenAI’s embeddings API, Pinecone, and the ChatCompletion API, we can create a powerful tool for searching and retrieving blog post content using natural language.

Note that there are some potential issues as well. The code we show is merely a starting point:

  • Limitations of cosine similarity: it does not take into account all properties of the vectors, which can lead to misleading results
  • Prompt engineering: the prompt we use works but there might be prompts that just work better. Experimentation with different prompts is crucial!
  • Embeddings: OpenAI embeddings are trained on a large corpus of text, which may not be representative of the domain-specific language in the posts
  • Performance might not be sufficient if the size of the database grows large. For my blog, that’s not really an issue. 😀

Step-by-Step Guide: How to Build Your Own Chatbot with the ChatGPT API

In this blog post, we will be discussing how to build your own chat bot using the ChatGPT API. It’s worth mentioning that we will be using the OpenAI APIs directly and not the Azure OpenAI APIs, and the code will be written in Python. A crucial aspect of creating a chat bot is maintaining context in the conversation, which we will achieve by storing and sending previous messages to the API at each request. If you are just starting with AI and chat bots, this post will guide you through the step-by-step process of building your own simple chat bot using the ChatGPT API.

Python setup

Ensure Python is installed. I am using version 3.10.8. For editing code, I am using Visual Studio code as the editor. For the text-based chat bot, you will need the following Python packages:

  • openai: make sure the version is 0.27.0 or higher; earlier versions do not support the ChatCompletion APIs
  • tiktoken: a library to count the number of tokens of your chat bot messages

Install the above packages with your package manager. For example: pip install openai.

All code can be found on GitHub.

Getting an account at OpenAI

We will write a text-based chat bot that asks for user input indefinitely. The first thing you need to do is sign up for API access at Access is not free but for personal use, while writing and testing the chat bot, the price will be very low. Here is a screenshot from my account:

Oh no, $0.13 dollars

When you have your account, generate an API key from Click the Create new secret key button and store the key somewhere.

Writing the bot

Now create a new Python file called and add the following lines:

import os
import openai
import tiktoken

openai.api_key = os.getenv("OPENAI_KEY")

We already discussed the openai and tiktoken libraries. We will also use the builtin os library to read environment variables.

In the last line, we read the environment variable OPENAI_KEY. If you use Linux, in your shell, use the following command to store the OpenAI key in an environment variable: export OPENAI_KEY=your-OpenAI-key. We use this approach to avoid storing the API key in your code and accidentally uploading it to GitHub.

To implement the core chat functionality, we will use a Python class. I was following a Udemy course about ChatGPT and it used a similar approach, which I liked. By the way, I can highly recommend that course. Check it out here.

Let’s start with the class constructor:

class ChatBot:

    def __init__(self, message):
        self.messages = [
            { "role": "system", "content": message }

In the constructor, we define a messages list and set the first item in that list to a configurable dictionary: { "role": "system", "content": message }. In the ChatGPT API calls, the messages list provides context to the API because it contains all the previous messages. With this initial system message, we can instruct the API to behave in a certain way. For example, later in the code, you will find this code to create an instance of the ChatBot class:

bot = ChatBot("You are an assistant that always answers correctly. If not sure, say 'I don't know'.")

But you could also do:

bot = ChatBot("You are an assistant that always answers wrongly.Always contradict the user")

In practice, ChatGPT does not follow the system instruction to strongly. User messages are more important. So it could be that, after some back and forth, the answers will not follow the system instruction anymore.

Let’s continue with another method in the class, the chat method:

def chat(self):
        prompt = input("You: ")
            { "role": "user", "content": prompt}
        response = openai.ChatCompletion.create(
            messages = self.messages,
            temperature = 0.8
        answer = response.choices[0]['message']['content']
           { "role": "assistant", "content": answer} 

        tokens = self.num_tokens_from_messages(self.messages)
        print(f"Total tokens: {tokens}")

        if tokens > 4000:
            print("WARNING: Number of tokens exceeds 4000. Truncating messages.")
            self.messages = self.messages[2:]

The chat method is where the action happens. It does the following:

  • It prompts the user to enter some input.
  • The user’s input is stored in a dictionary as a message with a “user” role and appended to a list of messages called self.messages. If this is the first input, we now have two messages in the list, a system message and a user message.
  • It then creates a response using OpenAI’s gpt-3.5-turbo model, passing in the self.messages list and a temperature of 0.8 as parameters. We use the ChatCompletion API versus the Completion API that you use with other models such as text-davinci-003.
  • The generated response is stored in a variable named answer. The full response contains a lot of information. We are only interested in the first response (there is only one) and grab the content.
  • The answer is printed to the console.
  • The answer is also added to the self.messages list as a message with an “assistant” role. If this is the first input, we now have three messages in the list: a system message, the first user message (the input) and the assistant’s response.
  • The total number of tokens in the self.messages list is computed using a separate function called num_tokens_from_messages() and printed to the console.
  • If the number of tokens exceeds 4000, a warning message is printed and the self.messages list is truncated to remove the first two messages. We will talk about these tokens later.

It’s important to realize we are using the Chat completions here. You can find more information about Chat completions here.

If you did not quite get how the text response gets extracted, here is an example of a full response from the Chat completion API:

 'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve',
 'object': 'chat.completion',
 'created': 1677649420,
 'model': 'gpt-3.5-turbo',
 'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87},
 'choices': [
    'message': {
      'role': 'assistant',
      'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'},
    'finish_reason': 'stop',
    'index': 0

The response is indeed in choices[0][‘message’][‘content’].

To make this rudimentary chat bot work, we will repeatedly call the chat method like so:

bot = ChatBot("You are an assistant that always answers correctly. If not sure, say 'I don't know'.")
    while True:

Every time you input a question, the API answers and both the question and answer is added to the messages list. Of course, that makes the messages list grow larger and larger, up to a point where it gets to large. The question is: “What is too large?”. Let’s answer that in the next section.

Counting tokens

A language model does not work with text as humans do. Instead, they use tokens. It’s not important how this exactly works but it is important to know that you get billed based on these tokens. You pay per token.

In addition, the model we use here (gpt-3.5-turbo) has a maximum limit of 4096 tokens. This might change in the future. With our code, we cannot keep adding messages to the messages list because, eventually, we will pass the limit and the API call will fail.

To have an idea about the tokens in our messages list, we have this function:

def num_tokens_from_messages(self, messages, model="gpt-3.5-turbo"):
            encoding = tiktoken.encoding_for_model(model)
        except KeyError:
            encoding = tiktoken.get_encoding("cl100k_base")
        if model == "gpt-3.5-turbo":  # note: future models may deviate from this
            num_tokens = 0
            for message in messages:
                num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
                for key, value in message.items():
                    num_tokens += len(encoding.encode(value))
                    if key == "name":  # if there's a name, the role is omitted
                        num_tokens += -1  # role is always required and always 1 token
            num_tokens += 2  # every reply is primed with <im_start>assistant
            return num_tokens
            raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.""")

The above function comes from the OpenAI cookbook on GitHub. In my code, the function is used to count tokens in the messages list and, if the number of tokens is above a certain limit, we remove the first two messages from the list. The code also prints the tokens so you now how many you will be sending to the API.

The function contains references to <im_start> and <im_end>. This is ChatML and is discussed here. Because you use the ChatCompletion API, you do not have to worry about this. You just use the messages list and the API will transform it all to ChatML. But when you count tokens, ChatML needs to be taken into account for the total token count.

Note that Microsoft examples for Azure OpenAI, do use ChatML in the prompt, in combination with the default Completion APIs. See Microsoft Learn for more information. You will quickly see that using the ChatCompletion API with the messages list is much simpler.

To see, and download, the full code, see GitHub.

Running the code

To run the code, just run On my system, I need to use python3 I set the system message to You are an assistant that always answers wrongly. Contradict the user. 😀

Here’s an example conversation:

Although, at the start, the responses follow the system message, the assistant starts to correct itself and answers correctly. As stated, user messages eventually carry more weight.


In this post, we discussed how to build a chat bot using the ChatGPT API and Python. We went through the setup process, created an OpenAI account, and wrote the chat bot code using the OpenAI API. The bot used the ChatCompletion API and maintained context in the conversation by storing and sending previous messages to the API at each request. We also discussed counting tokens and truncating the message list to avoid exceeding the maximum token limit for the model. The full code is available on GitHub, and we provided an example conversation between the bot and the user. The post aimed to guide both beginning developers and beginners in AI and chat bot development through the step-by-step process of building their chat bot using the ChatGPT API and keep it as simple as possible.

Hope you liked it!

%d bloggers like this: