ChatGPT is awesome, but it has the significant drawbacks that its knowledge cuts off in 2021 and it was trained only on publicly available or specifically licensed data. As a result, it is not aware of the vast amount of data found in private repositories, databases, and knowledge bases. But a new approach to overcome this deficiency is rapidly gaining popularity. This approach leverages ChatGPT’s ability to learn from in-context data, which consists of additional data that a user can provide in the prompt as additional context to the question.
Using this technique, you can take advantage of ChatGPT’s ability to query your data using natural language and allow your users to converse with ChatGPT about your data. The general approach is as follows:
- Tokenize documents and content in your knowledge base, dividing it into smaller, more manageable sections or chunks.
- Use OpenAI’s API to generate embeddings for each of the chunks.
- Store the embeddings in a database with support for vector searches, enabling efficient storage and retrieval of the chunks.
- For each user query, generate an embedding for the query and perform a similarity search on the database to find the most relevant chunks in the knowledge base.
- Inject the relevant chunks into the prompt to provide contextual information to ChatGPT.
Let’s delve into the details.
Tokenize Your Knowledge Base into Chunks
Documents should first be split up into smaller semantic chunks that can be individually stored and searched. How you split up your content into chunks depends on the type of content. Binary file types, such as PDF files, must first be converted to text before being broken down into smaller sections. Content already in text format only needs to be split into smaller sections using a parser. The parser should split the content into semantically meaningful chunks. For instance, in HTML and markdown pages, you can create smaller sections based on the semantic elements in the page like headers and sections.
Generate Embeddings for the Chunks
Once the content is divided into smaller chunks, you can generate embeddings for each chunk using OpenAI’s embeddings API. An embedding is a numerical representation of the text that captures the underlying semantic information of the text in a vector space. Embeddings are important because they allow us to perform similarity searches on text.
For example, the embedding for the sentence “I love apples” using a simple model generating 5-dimensional embeddings would result in a sequence of 5 numbers, such as [-0.123, 0.586, 0.234, -0.567, 0.897]. The OpenAI “text-embedding-ada-002” model generates embeddings with 1536 dimensions.
Store the Embeddings in a Vector Database
The next step is to store these text chunks alongside their embeddings in a database that supports vectors, or vector databases. These databases enable vector searches, also known as similarity searches, which find the most similar items in a collection based on their vector representations. Vector searches are used to find the most semantically related items to a given query. Various vector databases are available, including Pinecone (commercial cloud-based), Supabase, (open source Firebase alternative), PostgreSQL (both Supabase and PostgreSQL require the pgvector extension), Weaviate (open source cloud-based), Milvus (open source self-hosted), and Chroma (open source self-hosted with cloud planned). Redis also added support for Vector Similarity Search in RediSearch 2.4.
Generate User Query Embeddings and Find Relevant Chunks
When a user submits a query, generate an embedding for the text in the query the same way it was done for the chunks, using OpenAI’s embeddings API. The embedding is then used to query the vector database for the chunks of text most similar to the user’s query. This process can become quite complex, as the user may want to query only for a specific range of documents in your knowledge base, such as by date, or specify another type of filtering in natural language that would have to be interpreted. To address this, we should organize the data in the vector database into namespaces that can be queried individually and selectively combined. More importantly, we must be able to interpret the user’s natural language into an actual filter we can use when querying the knowledge base to restrict the chunks that are searched for similarity. We can, of course, leverage ChatGPT for that purpose too.
Inject the Relevant Chunks into the Prompt
Finally, we inject the most similar chunks retrieved from the vector database into the prompt, alongside the actual user’s query, when we forward the user’s query to ChatGPT. These chunks will provide the additional context that ChatGPT can use to respond to the query. However, here we can also run into issues pretty quickly due to the limitations in ChatGPT’s maximum prompt size.
GPT-3 has a limitation of 4,096 tokens and GPT-4 of 8,000 tokens (though they plan to raise that to 32,000 tokens). Tokens are the individual units of text that a language model such as GPT uses for understanding and generating text. Because tokens can represent words, punctuation marks, or even sub-words, the actual number of words you can use in a prompt is much less than the number of tokens. I have experienced errors in ChatGPT-4 with prompt sizes larger than 3,000 words, even though the limitation is 8,000 tokens. One rule of thumb I have seen mentioned frequently is that one token is roughly four characters, or 0.75, of an English word.
To avoid an error, we must tokenize the content to be submitted and calculate the number of tokens. This allows us to limit the number of tokens in the query and fine-tune the amount of context passed to ChatGPT, ensuring it stays within the token limit.
Harness Personalized AI for Boundless Data Exploration
We have witnessed the remarkable efficiency gains ChatGPT has brought to our daily tasks, using just the general knowledge from its training. Now imagine ChatGPT tapping into your organization’s data, enabling users to engage in natural language conversations with your data. The possibilities are endless as users explore, analyze, and dissect their data without limitations. This will unleash a whole new level of AI-driven productivity and user experience, truly transforming the way we interact with information.