03 June 2023
Before we jump into what vector databases are we need to understand what a Vector is. In mathematics and in Computer Science a Vector quantity is defined as a an entity that has a magnitude and direction, for example velocity, acceleration, force or even things like employee productivity, revenue growth etc.
The opposite of Vector quantities are called Scalar quantities and they include things like mass, time, distance, speed, salary, address etc. The databases we have been accustomed to usually store and process these scalar quantities.
A vector database is a database that stores data in the form of high-dimensional vectors. A Vector is a mathematical representations of features or attributes, and they contain multiple dimensions, which can vary in number based on the level of complexity and detail in the data.
Vectors are usually represented as list of numbers and can be manipulated using matrix algebra, their dimensionality can vary from a couple to thousands to millions of dimensions.
A Vector database allows us to store, query and manipulate such data. One thing to note is that all these operations are fuzzy rather than discrete, meaning instead of returning exact results they return similar results.
Embeddings are numerical representations, in the form of vectors or arrays, that depict the meaning and contextual information of the tokens processed and generated by the model. Embeddings helps the model and the database to gain semantic and syntactic understanding of the tokens.
The generation of vectors typically involves the application of a transformation or embedding function to raw data, including text, images, audio, video, and other types of information. The embedding functions can be various machine learning models, word embeddings, feature extraction algorithms and even large language models LLMs. After the application of embeddings these vectors then gets stored in the database, when a query is generated the same embeddings are applied to it and results are returned depending upon vector distance or similarity in vector space.
The similarity measure can be based on various metrics, such as Cosine Similarity, Euclidean Distance, Manhattan Distance, hamming distance, Jaccard similarity, Minkowski Distance etc. You can read about main Vector similarity search types here.
Standalone similarity search indexes also exist like Facebook AI Similarity Search but they lack the scalability, security and data management provided by the Vector databases.
Vector Databases makes it very easy to store, index and retrieve multi-dimensional vector data, this vector data can include feature and attributes representing characteristics of text, images, geo-spatial data, audio or video data or any type of data that can be represented numerically like genomics or scientific simulations.
Vector databases can also store Machine Learning models represented by vectors of weights and biases
The stored Vector data can be queried to provide similarity search nearest neighbor searches on large vector datasets. Applications of which can include but not limited to
One of the feature of Vector databases is that also they provide a long term memory for Large Language Models, by storing the information as vectors and supplying the these vectors along with user generated prompts to fine tune the results returned. By utilizing the vector database as a source of long-term memory, the language model can improve its performance and provide more relevant, personalized and context aware responses.
A few of the Vector databases on the market include:
In conclusion, vector databases offer a powerful solution for storing and manipulating high-dimensional vector data. By leveraging mathematical representations of features and attributes, vector databases enable efficient storage, indexing, and retrieval of multi-dimensional data.