Introduction to Vector Databases: A Way of Storing High-Dimensional Data

AI Club
10 min readMay 31, 2024

--

Written by: Umair Shakeel and Muhammad Awwab Khan

What is a Vector Database?

In simple terms, a vector database efficiently stores and indexes vector embeddings to enable quick retrieval and similarity search. It offers functionalities like CRUD operations (Create, Read, Update, Delete), metadata filtering, horizontal scaling for handling large volumes of data, and serverless deployment options.

In the recent tide of AI, efficient data processing has become more crucial than ever for applications that involve large language models, generative AI, and semantic search.

The foundation of all these new applications is vector embeddings, a kind of vector data representation that contains semantic information essential to the AI’s comprehension and upkeep of a long-term memory for use in carrying out difficult tasks.

But before diving into the details of a vector database, we need to understand some basic terms that often come up when discussing vector databases.

What is a Vector?

Well that’s easy a vector is just an array of numbers where each element represents a specific feature or attribute of the data.

vector = [0,-2,…4]

But what’s cool about vectors is that they can represent more complex objects like words, sentences, images or audio files in a continuous high dimensional space called an embedding.

When we look at a group of vectors in one space, we can say that some are closer together and others are further apart. Some vectors appear to cluster together, while others may be sparsely distributed across the space.

Image credits https://www.pinecone.io/

But our data is rarely represented as vectors. Here is where vector embedding comes into play. It is a technique that allows us to represent almost any data type as vectors.

It’s not as easy as just turning data into vectors, though. We want to make sure that we can work with this transformed data without losing the meaning of it as it was before. When we want to compare two sentences, we don’t just want to look at the words they use; we want to see if they mean the same thing. If we want to keep the meaning of the data, we need to know how to make vectors where the relationships between them make sense.

We need something called an embedding model to do this. So let’s first understand the embedding model.

Embedding Model

Many modern embedding models are built by feeding a large amount of labeled data into a neural network. You may have heard of neural networks before; they are a popular tool for solving a wide range of complex problems. To put it simply, neural networks are made up of layers of nodes connected by functions. We then train these neural networks to perform a variety of tasks. We train neural networks using supervised learning, which involves feeding the network a large set of training data consisting of pairs of inputs and labeled outputs. The neural network modifies each layer’s parameters with each training cycle. Eventually, even if it has never seen a specific input before, it will be able to predict what an output label should be for a given input.

This neural network, with the final layer removed, is essentially the embedding model. We receive a vector embedding for an input rather than a particular labeled value.

The widely used word2vec embedding model is a fantastic example and can be applied to a wide range of text-based tasks. See how semantically similar words are close, while words which are semantically different are far apart.

Embeddings generated using https://projector.tensorflow.org/

Embeddings map the semantic meaning of words together or similar features in virtually any other data type these embeddings can then be used for things like recommendation systems search engines and even text generation like chat GPT but once you have your embeddings the question becomes where do you store them and how do you query them quickly that’s where Vector databases come in.

In a relational database you have rows and columns in a document database you have documents and collections but in a vector database you have arrays of numbers clustered together based on similarity which can be queried with ultra low latency making it an ideal choice for AI driven applications.

How do vector databases work?

Each vector in a vector database represents an object or item, whether it is a word, an image, a video, a movie, a document, or another type of data. These vectors are likely to be long and complex, expressing each object’s location across dozens, possibly hundreds, of dimensions.

A vector database of movies, for instance, can locate movies according to criteria like length, genre, year of release, rating for parental guidance, number of common actors, number of common viewers, and so forth. It is likely that similar movies will be grouped together in the vector database if these vectors are made correctly.

With a vector database, we can add knowledge to our AIs, like semantic information retrieval, long-term memory, and more. The diagram below explains the role of vector databases in this type of application:

Image credits https://www.pinecone.io/

User Query:

  • You ask a query or request something into the ChatGPT application.

Embedding Creation:

  • The application transforms your input into a compact numerical form called a vector embedding.
  • This embedding provides the mathematical representation of your query.

Database Comparison:

  • The input vector embedding is then compared with other embeddings already stored in the vector database.
  • Similarity measuring algorithms(more on them below) help identify the most related embeddings based on content.

Output Generation:

  • The database generates a response composed of embeddings closely matching your query’s meaning.

User Response:

  • The response, containing relevant information linked to the identified embeddings, is returned back to you.

Follow-up Queries:

  • When you make more queries, the embedding model creates new embeddings.
  • These new embeddings are then used to locate identical embeddings in the database and connect them to the original content.

Until now, it was just a brief overview of how vector databases work in general. Let’s now discuss the details of indexing, querying and post-processing and their relevant algorithms.

In traditional databases, we usually query for rows with values that exactly match our query. In vector databases, we use a similarity metric to determine which vector is most similar to our query.

Many algorithms are used in vector databases, and they all contribute to Approximate Nearest Neighbor (ANN) searches. These algorithms use graph-based search, quantization, or hashing to optimize the search. The pipeline created by combining these algorithms enables the quick and precise retrieval of a query vector’s neighbors. Since the vector database yields approximations, accuracy versus speed is the main trade-off we take into account. The more precise the result, the slower the query will be.

Image credits https://www.pinecone.io/

Indexing: The vector database indexes vectors using one of several algorithms, including PQ, LSH, and HNSW. This step converts the vectors to a data structure that will make searching faster.

Querying: The vector database compares the query vector to the indexed vectors in the dataset to find the nearest neighbors (using the similarity metric used by that index).

Post Processing: In some cases, the vector database retrieves the final nearest neighbors from the dataset and post-processes them before returning the final results. This step may include reranking the nearest neighbors using a different similarity metric.

Algorithms

Hierarchical Navigable Small World (HNSW)

HNSW builds a structure that resembles a tree and is hierarchical, with each node representing a collection of vectors. The similarity between the vectors is represented by the edges connecting the nodes. The algorithm starts by creating a set of nodes, each with a small number of vectors. This could be accomplished either randomly or by using algorithms such as k-means to cluster the vectors, turning each cluster into a node.

After analyzing each node’s vectors, the algorithm creates an edge connecting it to the nodes whose vectors are most similar to its own.

When we query an HNSW index, it uses this graph to navigate the tree, stopping at nodes that are most likely to have vectors that are closest to the query vector.

Image credits https://www.pinecone.io/

Random Projection

Random projection works by projecting high-dimensional vectors into a lower-dimensional space using a random projection matrix. We generate a matrix of random numbers. The matrix’s size will determine the desired low-dimension value. We then calculate the dot product of the input vectors and the matrix, resulting in a projected matrix with fewer dimensions than our original vectors but maintaining their similarity.

When we query, we use the same projection matrix to map the query vector to a lower-dimensional space. The projected query vector is then compared to the database’s projected vectors to determine the nearest neighbors.

Image credits https://www.pinecone.io/

Because the dimensionality of the data is reduced, the search process is much faster than searching the entire high-dimensional space.

Other notable indexing algorithms include Product Quantization (PQ), Locality-sensitive Hashing (LSH), etc. Let’s now move onto similarity measuring algorithms.

Similarity Measures

Similarity measures are mathematical methods for determining the similarity between two vectors in a vector space. Similarity measures are used in vector databases to compare the vectors stored in the database and determine which ones are most similar to a specific query vector.

Cosine similarity: It computes the cosine of the angle between two vectors in a vector space. It ranges from -1 to 1, with 1 representing identical vectors, 0 representing orthogonal vectors, and -1 representing diametrically opposed vectors.

Euclidean Distance: It measures the straight-line distance between two vectors in a vector space. It ranges from 0 to infinity, with 0 representing identical vectors and higher values representing increasingly dissimilar vectors.

Dot Product: It is defined as the product of the magnitudes of two vectors and the cosine of their angle. It ranges from -∞ to ∞, with positive values representing vectors pointing in the same direction, zero representing orthogonal vectors, and negative values representing vectors pointing in opposite directions.

Advantages of a vector databases

Speed and performance: Vector databases use a variety of indexing techniques to facilitate faster searching. Vector indexing, combined with similarity measuring algorithms such as nearest neighbor search, is especially useful for searching for relevant results across millions, if not billions, of data points while maintaining optimal performance.

Scalability: Vector databases can store and manage massive amounts of unstructured data by scaling horizontally, ensuring performance even as query demands and data volumes increase.

Cost Control: Vector databases offer a cost-effective solution for training and fine-tuning foundation models. This lowers the cost and speed of inferencing foundation models.

Flexibility: Vector databases are designed to handle complex data such as images, videos, and other multidimensional formats. Vector databases can be used for a variety of purposes, from semantic search to conversational AI applications, and can be tailored to your business and AI needs.

Long term memory of LLMs: Organizations can begin with general-purpose models such as IBM Watsonx.ai’s Granite series models, Meta’s Llama-3, or Google’s Gemini, and then add their own data in a vector database to improve the output of the models and AI applications critical to retrieval augmented generation.

Use Cases of Vector Databases

Vector databases have numerous applications that are constantly expanding. Some key use cases are:

Semantic search: Perform searches based on the meaning or context of a query to produce more precise and relevant results. Because words and phrases can be represented as vectors, semantic vector search functionality understands user intent better than general keywords.

Similarity search and applications: Easily find similar images, text, audio, or video data for content retrieval, such as advanced image and speech recognition, natural language processing, and more.

Recommendation engines: Vector databases and vectors can be used to represent customer preferences and product attributes on e-commerce sites. This allows them to suggest items similar to previous purchases based on vector similarity, which improves user experience and retention.

Music and multimedia streaming platforms: Platforms like Spotify make use of vector databases to save music as vectors based on their notes, rhythm and melody. When you listen to classical music, the vector database finds tracks with similar rhythms and genres. Therefore they offer you playlist suggestions as per your music taste.

In this article on vector databases, we explored how these specialized systems efficiently store and retrieve high-dimensional data using vector embeddings. We learned about the fundamental concepts of vectors and embeddings, and how vector databases leverage them to enable fast similarity searches. Additionally, we discussed the architecture of vector databases, their applications in recommendation systems and image retrieval. With features like CRUD operations, metadata filtering, and horizontal scaling, vector databases offer a powerful solution for managing complex data effectively.

--

--

AI Club

The AI Club was founded by the students of NEDUET with the primary motive of providing opportunities and a networking medium for students, in the domain of AI.