Table of Contents
Open Table of Contents
Vector
What is a vector? From mathmatics, a vector represent an element in a vector space . For example, a 2D vector can be represented as (x, y). A 3D vector can be represented as (x, y, z). A 4D vector can be represented as (x, y, z, w). A nD vector can be represented as (x1, x2, ..., xn).
However, we might more familair with it’s visulization representation.
In computer science, arrays can be used to represent vectors in a vector space, where the length of the array corresponds to the dimension of the vector space.
Embedding
- OpenAI Embedding provide API to convert text into vectors.
- Jinaai/jina-clip-v1. This model is english multimodal (text-image) embedding model, can be used via Hugging Face’s transformers library or the transformer.js library for browser-based inference.
Similarity Metrics of Vectors
| Metric | Formula | Description |
|---|---|---|
| Euclidean Distance | The squared euclidean distance between two vectors. | |
| Cosine Distance | Measures the cosine of the angle between two vectors, indicating their directional similarity. | |
| Dot Product | A measure of vector multiplication, indicating similarity in the direction of vectors. | |
| Manhattan Distance | Measures the absolute difference between corresponding elements of two vectors. | |
| Hamming Distance | Counts the number of positions at which the corresponding elements of two vectors differ. |
Indexing
We could use KNN (K-Nearest Neighbor) algorithm find nearest vectors in a space for a given query vector. The K here is a hyperparameter set by us which denotes how many nearest neighbors we want to retrieve. We can perform KNN on the vectors we have for our data and retrieve the nearest neighbors for our query vector depending on the distance between the vectors. However, KNN algorithm has a time complexity that increases linearly as the size of the dataset grows. To optimize KNN, there are two main approaches:
- Reduce the amount of similarity calculations;
- Reduce the number of times similarity calculations are performed.
Approximate Nearest Neighbors (ANN) improve lookup times by building index structures to reduce the number of calculations. ANN algorithms can be divided into three distinct categories: trees, hashes, and graphs. Hierarchical Navigable Small World (HNSW) graphs are among the top-performing indexes for vector similarity search. Most of vector databases use HNSW to index the vectors.
Vector Database
Emil Fröberg has a great comparison review of select vector databases in his article Picking a vector database: a comparison and guide for 2023.
Just want to add an additional vector database that I interested in: turbopuffer, which is a serverless vector database.
AI Applications
- Prepare: Start with your raw data—objects, which might be text, images, or audio.
- Embedding Contents: Feed that data into an embedding model, which will convert input object into vectors.
- Store Embedding: save those vectors with their metadata into a vector database. Vector database will build index for these vectors.
- Query: For application need to search similar objects, you can create a query.
- Embedding Query: The query also needs to be converted into a vector using the same embedding model.
- Query Embedding: The vector database will compares embedded queries to the indexed vectors within the dataset to identify ‘nearest neighbors’.