A QUICK SUMMARY – FOR THE BUSY ONES
Context & Problem:
Scroll down to explore the implementation of vector embeddings in more detail, step by step.
TABLE OF CONTENTS
In this series, we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will focus on the theory of vector embeddings.
Numbers, vectors, and charts are real data unless stated otherwise
Notion, with its infinite flexibility, is perfect for keeping unstructured notes, structured databases, and anything in between. Thanks to that flexibility, adding stuff is easy. It’s so easy in fact, that we add, and add, and add, then add some more until we have 5000 pages of meeting-notes, temporary-notes and god-knows-what-notes.
Do you have that problem at your company? Tens of thousands of pages in Notion, created by dozens of people with different ideas about naming and structuring data. You’re adding new pages regularly just to keep things from being lost and then… they are lost. You try to search for something and you get pages that are kind-of on-topic, but don’t answer the question, some completely unrelated stuff, a random note from 2016, and an empty page for good meassure. Notion truly works in mysterious ways.
We can divide companies into two groups: those before their failed Notion reorganization attempt, and those after. We can try to clean it up, reorganize, and add tags but trying to clean up thousands of pages, with new ones being added daily, is doomed to fail. So we’ve decided to solve this problem with the power of neural networks. Our goal was to create a separate index that would allow us to efficiently search through all the pages in Notion and find ones that are actually relevant to our query. And we did it! (with some caveats - more on that at the end of the series)
But before we get there, let’s dive into the technologies we’re going to need.
Let’s consider a simple neural network that accepts text as input. It could be a classification network for example. Let’s say that you want to train the network to detect texts about foxes. How can you represent a text so that a neural network can work on it?
The quick brown fox jumps over the lazy dog
Neural networks understand and can work on vectors.
Well, actually, while a vector is just a one-dimensional tensor and simple neural networks can have vectors as input, more complex architectures, including most language models, work on higher-dimensional tensors. We’re simplifying a bit just to get to the point.
A n-dimensional vector is just a one-dimensional array with length n. The text about the brown fox is definitely not a vector. Let’s do something about it. One way of converting the text to numbers is to create a dictionary and assign each word a unique ID, like this (these are actual IDs from the dictionary used by OpenAI’s GPT models):
You may have noticed that the words 'the' and ' the' (with space at the beginning) have different IDs. Why do you think it’s useful to have those as separate entries in the dictionary?
So we figured out how to convert text into a vector, great! There are issues with it though. First, even though our example is straightforward and small, real texts may be much longer. Our vector changes dimensionality based on the length of the input text (which in itself may be an issue for some simple architectures of neural networks) so for long texts, we need to deal with huge vectors. Additionally, it’s harder to successfully train the neural network on data that contains big numbers (without going into details, while neural networks do calculations on big numbers just fine, the issue is the scale of the numbers and the potential for numerical instability during training when using gradient-based optimization methods). We could try one-hot encoding (by taking all entries from a dictionary, assigning 0 to entries that do not appear in our text, and 1 to ones that do) to have a constant-length vector with only ones and zeros, but that doesn’t really help us that much. We still end up with a large vector - this time it’s huge no matter the length of the input text and additionally, we lose the information about word position. And last but not least: what about the words we don’t have in our dictionary? We need syllables or even separate letters to be assigned IDs in some texts which makes the resulting vector even longer and messier.
The dictionary-based vectors are used in practical NLP applications e.g. as inputs to neural networks. One-hot encoding is also used but for different purposes (e.g. when the dictionary is small, or for categorization problems).
There is one thing that our vector could tell us about the input text, but doesn’t - the meaning. We could decide if the text is about foxes just by checking if there is an element 21831 in it, but what about the texts about "omnivorous mammals belonging to several genera of the family Canidae” or about "Vulpes bengalensis”? Meaning can be conveyed in many more ways than we can include in a simple algorithm or a whitelist of words (I bet you wouldn’t think of adding “bengalensis” to your dictionary, would you?). For that reason, to know for sure, we need to process the whole text, even if thousands of words long. The only way we can reliably decide if the text is about foxes is to pass the vector on to the advanced, and expensive, language model to process it and understand what those words, and the relationships between them, mean. Or is it?
Embed… what? In simple terms, embedding is a way of representing data in a lower-dimensional space while preserving some relationship between data points. In other words, embedding is a smaller vector representation of a larger vector (or more general: a tensor), that still contains some important information about it. It can be calculated algorithmically but in cases like ours, this is done with the help of neural networks.
There are many ways to reduce the dimensionality of the data, like principal component analysis which is fast, deterministic, and used in data compression, noise reduction, and many other areas, or t-SNE which is used mainly in high dimensional data visualization and in the next article of this series ;)
The simplest example would be to represent our huge vector with potentially thousands of dimensions as just one number, representing how much about foxes the text is. For the sentence about jumping fox, we would probably get something like this:
Not very exciting. In reality, embeddings are a little bigger - depending on the use case they may have hundreds or thousands of dimensions.
You may now think: "Wait a moment. Thousands of dimensions? It is certainly not less than those few numbers representing our jumping fox, isn't it?". In our case yes - the embedding will actually be bigger than the dictionary-based vector. But that vector is also an embedding of a huge, practically infinite-dimensional vector in the space of all possible texts. In that sense, we have a dictionary-based embedding which can be small for short texts but is not very useful, and a powerful embedding that may be a little bigger but gives us way more easily extractable information about the text.
Embeddings can be calculated from many different types of data: images, sounds, texts, etc. Each of those is just a big vector, or actually, a tensor. You can think about an image as a 2-dimensional tensor (a matrix) of color values, as big, as many pixels there are in an image. A sound file is often represented as a spectrogram which is also a 2-dimensional tensor.
In each case, each dimension of the, often smaller, embedding vector encodes a different aspect of the item’s meaning.
It’s also worth mentioning that, while in principle, each dimension of the embedding encodes some aspect of the original, in most cases, it is not possible to determine what exactly each number in the embedding means. The idea of “foxness”, while may be preserved in the embedding in some way, is foreign to the embedding model and is unlikely encoded in a single dimension but rather in a much more complicated and nuanced way.
Because vectors are also points in n-dimensional spaces, we can think about the relative positions in those spaces as semantic relationships. The closer the embedding vectors are, the more similar the data points are. In the case of image embeddings, images of dogs would be very close to each other, quite close to images of foxes but far away from images of spaceships. The same applies to text embeddings.
By “closer” we don’t necessarily mean the lowest euclidian distance but rather the closest angle, measured by cosine similarity. The reason for that is that in high-dimensional space the distances or the magnitudes of vectors become much less informative. The actual semantic differences are encoded in the angles, or directions of vectors. The magnitudes may also contain useful information such as the emphasis or frequency, but they’re harder to use due to "the curse of dimensionality" - a set of phenomena resulting from increasing the number of dimensions like rapid increase in space volume.
While all this is true of the types of embeddings we talk about (and care about) in this series, in reality not all embedding methods encode semantic information about the original. What is encoded in the embedding values, and what are the relationships between datapoints depend on the specific training process of the embedding model and its purpose. You can easily imagine an embedding calculated from an image that captures some aspects of color palette, or an audio embedding that encodes its atmosphere instead of semantic contents etc.
As mentioned earlier, we calculate embeddings like this with the help of a neural network. There are many models that can do this from simple and small models you can run on your computer to large, multi-language, commercially available models like OpenAI's ADA. In our case, for the simplicity of use, and to get the best performance possible we’ve decided to use the latter. Below is a fragment of text embedding of our fox text calculated using OpenAI's ADA embeddings model. The particular version of the ADA model we’ve used creates 1.5k-long vectors:
0.0035436812, 0.0083188545, -0.014174725, …, -0.031340215, 0.018409548, -0.041342948
A long vector like this is not terribly useful for us by itself. We can’t really “read” any interesting information from it nor we can determine if it’s about foxes just by looking at it. However, we can use the relationships between embeddings of different texts to our advantage, for example, to find foxes or to implement efficient semantic search. But this is a topic for the next article.
Every year, Brainhub helps 750,000+ founders, leaders and software engineers make smart tech decisions. We earn that trust by openly sharing our insights based on practical software engineering experience.
Popular this month
Get smarter in engineering and leadership in less than 60 seconds.
Join 300+ founders and engineering leaders, and get a weekly newsletter that takes our CEO 5-6 hours to prepare.
previous article in this collection
It's the first one.
next article in this collection
It's the last one.