Skip to content

Embedding and Vector Database

Published: at 01:28 AM

Table of Contents

Open Table of Contents

Vector

What is a vector? From mathmatics, a vector represent an element in a vector space Rn\mathbb{R}^n . For example, a 2D vector can be represented as (x, y). A 3D vector can be represented as (x, y, z). A 4D vector can be represented as (x, y, z, w). A nD vector can be represented as (x1, x2, ..., xn).

v=(v1,v2,...,vn)Rn\vec{v} = (v_1, v_2, ..., v_n) \in \mathbb{R}^n

However, we might more familair with it’s visulization representation. vector-visualization

In computer science, arrays can be used to represent vectors in a vector space, where the length of the array corresponds to the dimension of the vector space.

Embedding

Similarity Metrics of Vectors

vector-visualization

MetricFormulaDescription
Euclidean Distance
d(a,b)=i=1n(aibi)2d(\vec{a}, \vec{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}
The squared euclidean distance between two vectors.
Cosine Distance
cos(θ)=abab\text{cos}(\theta) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}
Measures the cosine of the angle between two vectors, indicating their directional similarity.
Dot Product
ab=i=1naibi\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i
A measure of vector multiplication, indicating similarity in the direction of vectors.
Manhattan Distance
d(a,b)=i=1naibid(\vec{a}, \vec{b}) = \sum_{i=1}^{n} \lvert a_i - b_i \rvert
Measures the absolute difference between corresponding elements of two vectors.
Hamming Distance
d(a,b)=i=1n(aibi)d(\vec{a}, \vec{b}) = \sum_{i=1}^{n} (a_i \neq b_i)
Counts the number of positions at which the corresponding elements of two vectors differ.

Indexing

We could use KNN (K-Nearest Neighbor) algorithm find nearest vectors in a space for a given query vector. The K here is a hyperparameter set by us which denotes how many nearest neighbors we want to retrieve. We can perform KNN on the vectors we have for our data and retrieve the nearest neighbors for our query vector depending on the distance between the vectors. However, KNN algorithm has a time complexity that increases linearly as the size of the dataset grows. To optimize KNN, there are two main approaches:

Approximate Nearest Neighbors (ANN) improve lookup times by building index structures to reduce the number of calculations. ANN algorithms can be divided into three distinct categories: trees, hashes, and graphs. Hierarchical Navigable Small World (HNSW) graphs are among the top-performing indexes for vector similarity search. Most of vector databases use HNSW to index the vectors.

Vector Database

Emil Fröberg has a great comparison review of select vector databases in his article Picking a vector database: a comparison and guide for 2023.

Just want to add an additional vector database that I interested in: turbopuffer, which is a serverless vector database.

AI Applications

images

  1. Prepare: Start with your raw data—objects, which might be text, images, or audio.
  2. Embedding Contents: Feed that data into an embedding model, which will convert input object into vectors.
  3. Store Embedding: save those vectors with their metadata into a vector database. Vector database will build index for these vectors.
  4. Query: For application need to search similar objects, you can create a query.
  5. Embedding Query: The query also needs to be converted into a vector using the same embedding model.
  6. Query Embedding: The vector database will compares embedded queries to the indexed vectors within the dataset to identify ‘nearest neighbors’.