Vector Databases 📦

Vector Databases 📦

·

3 min read

A vector database is a type of database optimized for storing, indexing, and querying high-dimensional vectors. these vectors often represent data in numerical form, such as embeddings generated by machine learning models from texts, images, audio, or other unstructered data. Vector databases are designed to support similarity search and nearest neighbor search efficiently, which are critical for applications like recommendation systems, semantic seach, web crawlers, Anomaly Detection, and natural language processing,

  1. Similarity Search concept
  • Visualizes the process of vector similarity search

  • Shows query vector, closest matches and ranked results

  • Highlights different similiarity metrics used in vector databases

  1. Vector Embedding Representation
  • Demonstrates how different types of data(text, image, audio) are converted to vector embeddings

  • Illustrates semantic search and similarity matching

  • Highlights connections between different vector databases concepts

Simple step by step how they do what they do

Image credits to v7labs

1) Data conversion (Embedding)

  • Take raw data(texts, images, etc)

  • Use AI models to convert this data into numerical vectors

  • Think of it like translating words or images into a languag of numbers

  • Each vector represent the core meaning or features of the orignal data

2) Vector Creation

  • These vectors are typically high-dimensional (Imagine a list of 100-1000 numbers)

  • Each number represents a specific characteristic or feature

  • similar items will have similar vector representations

  • Example: “cat” and “kitten” would have a very close vector values

3) Indexing

  • Store these numerical vectors in a special database

  • Create smart indexes to make searching faster

  • Use advanced algorithms to organize vectors efficiently

  • Think of it like creating a super-smart, hyper-organized library

  1. Similiarity search
  • When you search, convert your query into a vector

  • Compare this vector with stored vectors

  • Use mathematical techniques to find the closest matches

  • Common methods:

    • Cosine similarity

    • Euclidean distance

    • Dot product

  1. Retrival
  • Rank results based on how close they are to the original vector

  • Return the most similar items

  • Can be used for:

    • Semantic search

    • Recommendation systems

    • Image/text similarity detection

N- dimensional Vectors

  • A list of n numbers representing a point in n-dimensional space

  • Each number is a coordinate in that space

  • "n" can be any number (10, 100, 512, 1024 are common)

  • Use neural networks to convert data into vectors

  • Different models for different data types:

    • Text: Word2Vec, BERT, GPT embeddings

    • Images: Convolutional Neural Networks (CNNs)

    • Audio: Specialized audio embedding models

Widely used databases

1) Pinecone written in Python

2) Chroma written in Python

3) Qdrant written in Rust

Did you find this article valuable?

Support Thirumalai by becoming a sponsor. Any amount is appreciated!

Â