Vector database
Vector databases are specialized storage and retrieval systems optimized for handling high-dimensional numerical representations (vectors) of unstructured data[1]. In the context of large language models (LLMs), they enable efficient semantic search and are a key component of modern artificial intelligence systems, particularly in the RAG architecture.
Unlike traditional relational databases, which are oriented toward exact matches, vector databases specialize in approximate nearest neighbor (Approximate Nearest Neighbor, ANN) search, finding semantically similar objects in a high-dimensional space[2].
Fundamentals of Vector Databases
Vector Representations (Embeddings)
Vector representations (embeddings) are numerical representations of text, images, audio, and other data types in the form of vectors. The key principle is that semantically similar objects (e.g., words with similar meanings) are located close to each other in this vector space[3].
Modern text embeddings are created using models based on the Transformer architecture, which apply self-attention mechanisms (self-attention) to understand context. The dimensionality of such representations ranges from 256 to 1024 dimensions or more for most modern models[4].
Similarity Metrics
To measure the "distance" or similarity between vectors, various metrics are used:
- Cosine similarity: Measures the cosine of the angle between two vectors. It is particularly effective for text embeddings as it considers the direction of the vectors, not their magnitude[5].
- Euclidean distance (L2): The standard straight-line distance between two points in space.
- Dot product: Similar to cosine similarity but is not normalized[6].
Indexing Algorithms
Specialized ANN algorithms are used for fast search in high-dimensional spaces.
HNSW (Hierarchical Navigable Small World)
The HNSW algorithm uses the "small world" concept and a multi-layered hierarchical graph structure. The upper layers contain long-range links for quickly traversing the space (coarse search), while the lower layers contain short-range links for precise neighbor finding. HNSW demonstrates a logarithmic time complexity of O(log N) and is the preferred choice for most modern vector databases[7].
IVF (Inverted File)
The IVF algorithm partitions the space into clusters using k-means clustering. The search is performed within a limited number of the nearest clusters, which significantly speeds up the process. The number of clusters is typically chosen as √N, where N is the total number of vectors in the dataset[8].
LSH (Locality-Sensitive Hashing)
The LSH algorithm uses a family of hash functions that are likely to produce the same hash for nearby vectors. This allows for the rapid grouping of similar objects[9].
Popular Vector Databases
- Pinecone: A fully managed, cloud-native vector database with a serverless architecture.
- Qdrant: A high-performance database written in Rust, with support for advanced filtering and ACID-compliant transactions.
- Milvus: A scalable, open-source database with a cloud-native architecture. It supports multiple index types, including GPU-accelerated variants.
- Weaviate: An open-source vector database with a GraphQL API and support for knowledge graphs.
- Chroma: A lightweight, open-source database optimized for rapid prototyping and experimentation.
- FAISS: A library from Meta, not a full-fledged database, but it provides high-performance indexing algorithms for static data.
Application with LLMs: The RAG Architecture
Retrieval-Augmented Generation (RAG) is an architecture where an LLM is supplemented with an external knowledge base through vector search. RAG systems consist of two main components[10]:
- Retriever: The search component that uses a vector database to find relevant information based on the user's query.
- Generator: The LLM that uses the original query and the information found by the retriever to generate a response.
For effective RAG performance, hybrid search is used—a combination of semantic (vector) search and lexical (keyword-based, e.g., BM25) search, which provides more accurate and relevant results.
Trends and Future Development
The vector database market is experiencing explosive growth, projected to increase from $1.98 billion in 2023 to $7.13 billion by 2029 (CAGR 23.7%)[11]. Key development trends include:
- Multimodal systems: Support for simultaneous search across text, images, audio, and video within a unified vector space.
- Automatic optimization: Using ML for the automatic selection of optimal indexes and parameters.
- Edge computing: Development of compact solutions for mobile and IoT devices.
- Quantum computing: Potential for exponential speedup in similarity search.
- Neuromorphic chips: Mimicking brain function for ultra-low power consumption during search operations.
Links
Further Reading
- Malkov, Y.A.; Yashunin, D.A. (2016). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. arXiv:1603.09320.
- Johnson, J.; Douze, M.; Jégou, H. (2017). Billion-Scale Similarity Search with GPUs. arXiv:1702.08734.
- Datar, M. et al. (2004). Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. SoCG 2004 paper.
- Guo, N. et al. (2020). ScaNN: Efficient Vector Similarity Search at Scale. In: Proc. ACM SIGKDD 2020, pp. 1571-1580. DOI:10.1145/3394486.3403339.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
- Wang, X. et al. (2021). Milvus: A Purpose-Built Vector Data Management System. In: SIGMOD 2021. DOI:10.1145/3448016.3457550.
- Lee, J. et al. (2022). OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries. arXiv:2211.12850.
- Fan, D. et al. (2023). Survey of Vector Database Management Systems. arXiv:2310.14021.
- Ren, R. et al. (2024). Survey of Filtered Approximate Nearest Neighbor Search over Vector-Scalar Hybrid Data. arXiv:2505.06501.
- Zhao, H. et al. (2024). Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search. arXiv:2401.02116.
- Liu, Y. et al. (2025). Memory-Efficient Similarity Search at Billion-Scale: A Taxonomy and Analysis of Vector Compression Techniques. ResearchGate preprint.
Notes
- ↑ "What Is a Vector Database?". CloudRaft. [1]
- ↑ "What is a Vector Database?". Qdrant Blog. [2]
- ↑ "What Are Vector Embeddings?". LakeFS. [3]
- ↑ "What are embeddings?". Zilliz. [4]
- ↑ Sahoo, A., Maiti, J. "A Comparative Study of Similarity Metrics for Textual Embeddings". arXiv:2501.01234. [5]
- ↑ "Vector search and dense vector fields". Elastic. [6]
- ↑ Malkov, Y. A., Yashunin, D. A. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". arXiv:1603.09320. [7]
- ↑ "The index IVF". FAISS Wiki. [8]
- ↑ Datar, M., et al. "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions". Symposium on Computational Geometry. [9]
- ↑ Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". arXiv:2005.11401. [10]
- ↑ "Vector Database Global Market Report 2024". The Business Research Company. [11]