How do vector databases work (for the lay-coder)?

Question

A vector database is a type of database that stores and manages unstructured data, such as text, images, or audio, in vector embeddings (high-dimensional vectors) to make it easy to find and retrieve similar objects quickly.

How do you make that more tangible, to a software engineer?

Some parts to answering this question:

What is meant by "high dimensional" vectors? Does that just mean the array of numbers is a big array? Why is that important / highlighted?
How is text, for example, converted to a vector of numbers (at a high level)?
Why do they call them vector "embeddings"?
I don't need to learn the theory of vectors, how can I understand vector databases practical function and use (as someone who wants to build an AI assistant), without learning about the theory of vectors exactly?

I am an experienced software developer with experience with relational databases like PostgreSQL, and other databases like MongoDB (a doc database) or Neo4j (a graph database). So how does a vector database work exactly, at a high level? I just would like to have a practical sense of how it works, so I know what I'm saving to when saving to pinecone, and what I'm querying and maybe how a query roughly works.

Relational databases are easy to understand. You just look at a spreadsheet, and the columns have names. Graphs are also easy to understand, you have your named models/types, and they have links between them. But vector databases? Is it literally a hashmap of key = something and value = array of numbers? Or what is it, how to get a practical sense like these other database types?

References I've used to get a slightly better understanding so far:

You know how when you go to an actual physical library, and you look a book up, and go to the shelf it's on by using the Library of Congress number, or Dewey decimal number, and the books near it are on very similar subjects? Well, it's like that, but with a whole lot more numbers than LoC or Dewey (thus "high-dimensional vectors"). — JonathanZ, Aug 11 '23 at 14:30

9072997 · Accepted Answer · 2023-08-14T12:45:57.157

When trying to understand vector databases, it is useful to start with the low-dimensional vectors your brain is really good at understanding: map coordinates.

Imagine you are Google Maps, and you have a database of every business on the planet; millions and millions of latitude/longitude pairs. Now you want to run a query: "Find restaurants near 38.8977° N, 77.0365° W". You could calculate the distance of every restaurant to that point using the distance formula you learned in grade school √((x2 – x1)² + (y2 – y1)²), then sort by the result of the formula. That would be slow though. You would have to do this calculation for every restaurant. What you want is to be able build an index for this query the same way you would build a hash-table index for an exact match query or a b-tree index for an inequality query.

Indexes that can do this for points* on a map are called spatial indexes. A database that supports these index types might be called a spatial database, but if you are willing to stretch your brain a little bit you might find applications for this type of index that don't have to do with maps. For example, consider this graph:

Suppose you want to know how a particular policy might impact a country, and you think GDP and birthrate are significant factors. You might ask yourself "what other countries are similar in these 2 ways that I have deemed significant". You would be looking for other countries that were your nearest neighbors, not on a physical map, but on the graph.

You can extend this idea to 3 dimensions. Maybe you're NASA and you need to find the closest objects in space to a given point. Because you are in 3d, your points will be 3d vectors (x, y, z). Maybe you want to do this for a graph, but you have 3 significant factors rather than 2. Again, it's the same principle just with 3d points rather than 2d points.

You can extend this further, into 4d, 5d, or 512d graphs. The problem is just that you quit being able to draw those graphs in our 3d reality. The math still works though.

One of the AI examples that is easiest to think about with this is facial recognition. You need to reduce a face to a bunch of numerical properties, like eye color, hair color, distance between the eyes, nose size, ear size, skin tone, etc. Then you need to find other faces that are "close" to this one.

* or other shapes, like squares or circles

NOTE: I have been explaining "distance" as euclidean distance. If you aren't trying to learn too much about vectors, don't worry about it, but there are alternative ways of measuring distance.

Nice answer, but I believe you mean to write "spatial" database and "spatial" index rather than "spacial." I tried to edit this post, but it's not enough of a change to qualify as an edit. — Matt Morgan, Aug 12 '23 at 19:49
I have made those corrections. As you can tell, spelling is not my strong suite. — 9072997, Aug 14 '23 at 12:50

Valentas · Answer 2 · 2023-08-11T16:39:35.473

What is meant by "high dimensional" vectors? Does that just mean the array of numbers is a big array? Why is that important / highlighted?

Yes an array of numbers (floats) of a fixed length. For example 256, 512, 1024 or 12888 (GPT-4).

How is text, for example, converted to a vector of numbers (at a high level)?

Neural networks learn to map tokens words or texts into arrays of numbers by performing stochastic gradient descent optimization for certain simple problems (for example predict the next word in the text).

Why do they call them vector "embeddings"?

One of the meanings in mathematics of this word is just mapping an object to a point in a space, in this case, for example the space $\mathbb{R}^{512}$.

I don't need to learn the theory of vectors, how can I understand vector databases practical function and use (as someone who wants to build an AI assistant), without learning about the theory of vectors exactly?

You can view them as traditional databases that store embeddings in tables that have $d$ columns, one for each dimension plus an identifier for an object and perhaps some extra metadata.

The main advertised advantage is their support for similarity search. In machine learning or information retrieval, when you have a vector (an array of numbers), representing an object or a query, there is often a need to find another vector representing a document that is most similar to the query vector. The neural networks that learn to produce those embeddings from texts or other objects often are such (or can be made such by certain techniques) that if objects x and y are similar in a certain semantic sense (humans would label them as similar), then for some simple mathematical function similarity, the value of similarity(embedding(x), embedding(y)) is very large. I know you don't want to learn, but the terms are cosine similarity, Euclidean distance.

The problem that similarity search indices solve is this: in order to find a document $y$ that maximizes similarity(embedding(x), embedding(y)) you would have to scan the whole table, which would be much too slow for applications that operate with millions of objects or more.

Recently a number of practically fast and very different methods have become popular (hnsw - a graph-based algorithm with an open source library, product quantization - a method based on some mathematical properties of vectors, developed by Google researchers and implemented by several companies commercially and also as open source (FAISS), Annoy - an open source method developed by an ex-developer in Spotify, etc.) that solve this "nearest neighbour" search problem not exactly, but approximately, guaranteeing, for example, that the most similar object will be found for 95% of queries.

There is also another (older) group of methods known as random projections/locality sensitive hashing developed by computer scientists, which are simple and have been proved to be optimal in theory, however they don't seem to be as successful in practice for this task.

So, while I have never used them, I assume that what the vector databases do, is they implement those indices and similarity search so that they it works with very large scale, distributed data, etc., and support insertion, deletion, updates, persisting to disk, and other essential database operations.

Quassnoi · Answer 3 · 2023-08-14T01:41:48.407

I don't need to learn the theory of vectors, how can I understand vector databases practical function and use (as someone who wants to build an AI assistant), without learning about the theory of vectors exactly?

You have some data — say, a closed internal corporate help system — that the AI model wasn't trained on. If you ask the AI "how do I blurb my squirms using my Foobarr(tm) platform", you won't get a meaningful answer.
You take the text from your help articles, and run a special function on every piece of text. This function:
1. Takes a text string on input¹.
2. Returns a tuple of numbers on output.
Such a function is called an embedding model. It's a deterministic function². For the same input it will always return the same output. Some popular examples of embedding models are ADA2 and Instructor, but there are many more.
You store these tuples of numbers (called embeddings) in a vector database, along with the key of your article (usually its URL or numeric id in the CRM). Some examples of vector databases are Pinecone and Weaviate, but there are many others.
The embegging model math is wired in such a way that strings of semantically close text return arrays of numbers with high cosine similarity³.

This is where the magic happens. The embedding model transforms texts that are close to each other semantically to tuples (fixed-length arrays) of numbers that are close to each other numerically.

Semantical closeness of phrases might be a very vague concept, but numerical closeness of vectors has a certain, strict mathematical definition which computers can work with.
When you get a query for your chat bot, you will first call this function on the user's prompt: "how do I blurb my squirms using my Foobarr(tm) platform" and get a tuple of numbers back. You save it in a variable prompt_embedding
You issue a query to your vector database to the following effect: "Here's a vector. Give me top three records, ordered by descending cosine similarity with this vector, as long as it's more than 0.8". You would use code similar to this⁴:
```
client.query
    .get("help_articles", ["content", "key"])
    .with_near_vector({
        "vector": prompt_embedding,
        "certainty": 0.8
    })
    .with_limit(3)
    .with_additional("certainty")
    .do()
```
Because of the way the embedding model works, there is a high chance that returned records will correspond to the articles relevant to blurbing the squirms.
You retrieve the text of the relevant articles using the key returned by the vector database⁵.

You feed the text of the articles to your AI model, usually in the system portion of the prompt⁶. On Llama2, your prompt to the model will look something like this:

<s>[INST] <<SYS>>
{{
You are a virtual assitant for Foobarr. You cover the following topics: Foobarr platform help. Use articles below to answer the question. Condense the most relevant article into no more than thirty sentences. Append its hyperlink. Append the hyperlinks to other relevant articles. Output in Markdown format.
Url: http://foobarr.example.com/help/blurbing-the-squirms
Content: Efficient Squirm Blurbing is our specialty! Foobarr(tm) platform will help you blurb your squirms in no time. Log in to the app using your mobile phone, point the camera to your squirm or squirms, and click "Blurb". The squirm gujjuggins will get extraburbated in the cloud…
Url: http://foobarr.example.com/help/blurb-efficiency
Content: What to do if gujjuggins come out underextraburbated? First thing to check is the squirm is sufficiently dimarquidated. To do this, open the main menu, click "Marquidation", then go to…
Url: http://foobarr.example.com/help/squirms-and-gujjuggins
Content: Gujjugings is what makes the squirms blurbable. They are tiny pieces of…
}}
<</SYS>>
{{ how do I blurb my squirms using my Foobarr(tm) platform? }} [/INST]

Note that the text in the articles is semantically relevant to your user's prompt.

The AI model will formulate the answer based on the user's prompt, domain knowledge you just gave it, its own knowledge of things and its own internal command of human language; and give it back to the user.

Some cloud products (which might be referred to as "large language models") bundle this functionality and will do it for your with a single HTTP call. They are still doing the same or similar thing under the hood, so the LLM proper is only a part of this bundle.

The embedding model, the vector database and the large language model are separate things, not coupled to each other. The (ultimate) artefact of the embedding model and the vector database is the list of human-language texts relevant to your prompt, which you can feed to any LLM to give it domain knowledge.

There are several practical considerations you'll have to keep in mind when using this approach:

LLM's have a limited prompt size. You will have to come up with a way to shrink your domain-specific portion of the prompt so that it fits into the limit.
Since the LLM doesn't have any memory, you will need to feed it the history of the conversation, complete with the domain-specific data, with every new prompt, if you want to maintain the conversation context. At some point you will need to throw away some of the history because it will not fit into the LLM any more. Deciding what to keep and what to throw away is a challenging task.
If the converation takes a sharp turn (when you jump from one topic to another), most LLMs will start to hallucinate or return responses irrelevant to the prompt. Detecting this is another challenging task.
Some prompts are of meta-nature (for instance, saying "Translate it for me into Russian" or "Expand the abbreviations"). Such prompts are usually prone to yielding false positive matches from the vector database, because they use generic text that has high chance to be semantically relevant to something in your articles.

Embedding scoring can help you with that as well. You can use it to decide which parts of the history are still relevant to the latest prompts, so that you can wipe some of the context, or even start the whole conversation altogether.

^{¹ Under the hood, the string of text is first converted to an array of numbers called tokens using another function called "tokenizer". Most popular embedding models come with libraries that will hide this step from you. From the user's perspective, it's a string in and a tuple of numbers out.
² This function is doing a combination of matrix multiplications and some other freshman year level math. It has some moving pieces:

In addition to the explicit parameter (a string of text), it takes a lot of numbers as implicit parameters. They usually live in a folder on your HDD, and get pre-loaded to the memory when you initialize the model library.

There might be also some additional explicit parameters in the function call, which you usually hardcode in your program and never change.

Differences in CPU and GPU architectures can slightly affect the precision of the math on different machines.

Some embedding models (like ADA2) you can only access as a remote function in the cloud, which means that they can replace any of that on their backend any time and the same HTTP call will give you a different answer.

But as long as all these moving pieces stay fixed, an embedding model is as deterministic as the sine or the cosine.
³ The linked article will tell you the exact formula how the cosine similarity is calculated. You don't have to know how it works to build the chat bot, but you have to know if that's the algorithm the model uses. The distance algorithm is mentioned in the model's description. Some models use distance algorithms other than cosine similarity, which your vector database should know of and support, if that's the case.
⁴ The vector database is able to execute this query efficiently, usually using the indexing algorithm known as Hierarchical Navigable Small Worlds. You don't have to know how it works internally to work with it.
⁵ You don't necessarily have to use the vector database for getting the text: you can just query your help CRM directly, or download it via HTTP straight from the website. Most vector databases are able to store extra data along with the vectors, letting you get the relevant text in a single call to the database.
⁶ ChatGPT and its friends have a special slot for it in the JSON format that they accept; Llama2 has a slot in its prompt structure format for it; for other models you might just insert it as a part of human-language prompt. Note that all models have limited prompt size; if your text is too long you will have to come up with a way to shorten it so that it fits in the prompt.}

How do vector databases work (for the lay-coder)?

3 Answers3