Embedding and Vector Database | Jiwei Yuan's Blog

Open Table of Contents

Vector
Embedding
Similarity Metrics of Vectors
Indexing
Vector Database
AI Applications

Vector

What is a vector? From mathmatics, a vector represent an element in a vector space $\mathbb{R}^n$ . For example, a 2D vector can be represented as (x, y). A 3D vector can be represented as (x, y, z). A 4D vector can be represented as (x, y, z, w). A nD vector can be represented as (x1, x2, ..., xn).

\vec{v} = (v_1, v_2, ..., v_n) \in \mathbb{R}^n

However, we might more familair with it’s visulization representation. vector-visualization

In computer science, arrays can be used to represent vectors in a vector space, where the length of the array corresponds to the dimension of the vector space.

Embedding

OpenAI Embedding provide API to convert text into vectors.
Jinaai/jina-clip-v1. This model is english multimodal (text-image) embedding model, can be used via Hugging Face’s transformers library or the transformer.js library for browser-based inference.

Similarity Metrics of Vectors

vector-visualization

Metric	Formula	Description
Euclidean Distance	$d(\vec{a}, \vec{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$	The squared euclidean distance between two vectors.
Cosine Distance	$\text{cos}(\theta) = \frac{\vec{a} \cdot \vec{b}}{\\|\vec{a}\\| \\|\vec{b}\\|}$	Measures the cosine of the angle between two vectors, indicating their directional similarity.
Dot Product	$\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i$	A measure of vector multiplication, indicating similarity in the direction of vectors.
Manhattan Distance	$d(\vec{a}, \vec{b}) = \sum_{i=1}^{n} \lvert a_i - b_i \rvert$	Measures the absolute difference between corresponding elements of two vectors.
Hamming Distance	$d(\vec{a}, \vec{b}) = \sum_{i=1}^{n} (a_i \neq b_i)$	Counts the number of positions at which the corresponding elements of two vectors differ.

Indexing

We could use KNN (K-Nearest Neighbor) algorithm find nearest vectors in a space for a given query vector. The K here is a hyperparameter set by us which denotes how many nearest neighbors we want to retrieve. We can perform KNN on the vectors we have for our data and retrieve the nearest neighbors for our query vector depending on the distance between the vectors. However, KNN algorithm has a time complexity that increases linearly as the size of the dataset grows. To optimize KNN, there are two main approaches:

Reduce the amount of similarity calculations;
Reduce the number of times similarity calculations are performed.

Approximate Nearest Neighbors (ANN) improve lookup times by building index structures to reduce the number of calculations. ANN algorithms can be divided into three distinct categories: trees, hashes, and graphs. Hierarchical Navigable Small World (HNSW) graphs are among the top-performing indexes for vector similarity search. Most of vector databases use HNSW to index the vectors.

Vector Database

Emil Fröberg has a great comparison review of select vector databases in his article Picking a vector database: a comparison and guide for 2023.

Just want to add an additional vector database that I interested in: turbopuffer, which is a serverless vector database.

AI Applications

images

Prepare: Start with your raw data—objects, which might be text, images, or audio.
Embedding Contents: Feed that data into an embedding model, which will convert input object into vectors.
Store Embedding: save those vectors with their metadata into a vector database. Vector database will build index for these vectors.
Query: For application need to search similar objects, you can create a query.
Embedding Query: The query also needs to be converted into a vector using the same embedding model.
Query Embedding: The vector database will compares embedded queries to the indexed vectors within the dataset to identify ‘nearest neighbors’.