A vector embedding converts text, images, or any other data into a list of numbers that preserves its semantic meaning. Two sentences with the same meaning end up close in that mathematical space; two unrelated sentences end up far apart. Semantic search, RAG, and modern recommendation systems are all built on this principle.

Key takeaways

  • An embedding is a real-number vector in a high-dimensional space (from 128 to 3072 dimensions).
  • Closeness between vectors measures semantic similarity, computed via cosine similarity or dot product.
  • Current models learn these representations by training on billions of pairs of similar texts.
  • Postgres with pgvector can store and query embeddings without a separate vector database.
  • Model choice depends on language, domain, and latency budget.

How embeddings are generated

A language model receives a sequence of tokens and produces, in its final layer, a real-number vector that captures the semantic context of the input text. If the model has learned that "dog" and "canine" appear in similar contexts, their vectors will be close together.

Transformer-based models produce contextual embeddings: the same token yields different vectors depending on the surrounding text. The foundational work by Mikolov et al. on Word2Vec (2013)[1] showed that word meaning could be encoded in dense vectors and that semantic relationships emerged arithmetically. As the authors put it: "Each relationship is characterized by a relation-specific vector offset. For example, if we denote the word vector for word ‘Paris’ as vec(‘Paris’), then the following relationship holds: vec(‘Paris’) – vec(‘France’) + vec(‘Italy’) is close to vec(‘Rome’)." (Mikolov et al., 2013)

The OpenAI embeddings documentation states it plainly: "An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness."

Cosine similarity: the standard measure

The standard way to compare two embeddings is cosine similarity: the cosine of the angle between the two vectors. It equals 1 when they point in the same direction (maximum similarity) and 0 when perpendicular (no relation).

Vector databases optimise this operation with HNSW or IVFFlat indexes. If you already run Postgres, you do not need extra infrastructure: the article on RAG with Postgres and pgvector covers that setup in detail.

Models available in 2026

The choice has narrowed to three families. The table below lists the most widely used models with their actual dimensions and the scenario where each fits best:

Model Dimensions Best for
text-embedding-3-small (OpenAI) 1 536 General semantic search, low cost
text-embedding-3-large (OpenAI) 3 072 Maximum precision in English, demanding RAG
all-MiniLM-L6-v2 (Sentence Transformers) 384 Fast prototypes, local execution
paraphrase-multilingual-MiniLM-L12-v2 384 Spanish and other multilingual texts

OpenAI’s proprietary models support dimension reduction (1 536 down to 512 for text-embedding-3-small, 3 072 down to 1 536 for text-embedding-3-large) without significant quality loss, courtesy of Matryoshka Representation Learning. For projects that prioritise privacy or inference cost, all-MiniLM-L6-v2 with 384 dimensions is the usual starting point. The OpenAI embeddings documentation[2] details the pricing and maximum context for each variant.

When to use embeddings

Embeddings underpin four categories of system:

  1. Semantic search. Users search by meaning rather than by exact words. Useful for documentation search engines, knowledge bases, or product catalogues.
  2. RAG. The language model retrieves relevant chunks from a knowledge base before generating a response. The Anthropic SDK agent guide covers integrating this pattern into production systems.
  3. Recommendation. Two products, articles, or songs with nearby embeddings are natural candidates to recommend to each other.
  4. Anomaly detection and observability. A point that sits far from its neighbours in embedding space may indicate fraud, spam, or a production failure. For teams monitoring AI agents, pairing embeddings with trace metrics is covered in the agent observability with OpenTelemetry article.

Frequently asked questions

How many dimensions does an embedding need to work well?

It depends on the use case. At 384 dimensions, models like all-MiniLM-L6-v2 handle general semantic search well for medium-sized projects. From 1 536 dimensions upward, OpenAI’s models capture finer nuances and suit demanding RAG or complex classification tasks. More dimensions mean higher storage cost and search latency, so it is worth benchmarking on your own data before committing.

Is an embedding the same thing as a large language model?

No. A large language model generates text; an embedding model transforms text into a numerical vector. Technically, embedding models tend to be smaller networks trained on similarity objectives rather than token prediction. Many LLMs use embeddings internally as an intermediate representation, but what they expose to users is text, not an exportable vector.

Can I generate embeddings locally without relying on external APIs?

Yes. Tools like Ollama let you run embedding models locally on Apple Silicon or a dedicated GPU. all-MiniLM-L6-v2 weighs under 100 MB and generates embeddings in milliseconds on a modern laptop. Latency is higher than an optimised API, but the cost is zero and data never leaves your infrastructure.

Conclusion

A vector embedding is a coordinate in a space where distance equals semantic difference. Understanding this mechanism is the starting point for working with RAG, semantic search, or any system that needs to relate data by meaning rather than by exact words. The logical next step is to choose a model, measure latency, and decide whether to store vectors in Postgres or in a dedicated solution.

Sources

  1. Word2Vec (2013)
  2. OpenAI embeddings documentation
  3. Hugging Face embeddings guide