Inteligencia Artificial

búsqueda semántica embeddings ia nlp openai sentence transformers

Text Embeddings: Turning Words Into Useful Vectors

October 26, 2023 10 min read 61 reads

Table of contents

Key takeaways
What an Embedding Really Is
Relevant Models
OpenAI Embeddings (text-embedding-ada-002)
Sentence Transformers (all-MiniLM, all-mpnet, etc.)
BGE Models (BAAI General Embedding)
How to Choose Between Them
Use Cases Where They Add Value
Cases Where They Aren’t the Best Option
The Chunking Detail
Conclusion

Actualizado: 2026-05-03

Embeddings are the numerical representation of text that lets AI models “understand” semantic similarity. Behind semantic search, RAG systems, modern text classification, and almost any current NLP product, there are embeddings. We unpack what they are exactly, how to choose a model, and the cases where they really add value over simpler alternatives.

Key takeaways

An embedding is a vector of N dimensions (typically 384, 768, or 1536) where semantically similar texts produce nearby vectors measured by cosine distance.
Three families dominate practical choice: OpenAI ada-002 (managed, simple), Sentence Transformers (open source, privacy), BGE (open source, maximum quality in English).
Chunking matters more than the model chosen: bad chunking is the number-one cause of underperforming RAG.
Embeddings aren’t the right tool for exact search or metadata filtering — for that there’s keyword search and SQL.
Corpus quality and chunking outweigh the difference between top-3 models in practical impact.

What an Embedding Really Is

An embedding is a vector of N dimensions (typically 384, 768, or 1536) representing the meaning of a text. The key property: texts with similar meaning produce nearby vectors measured by cosine distance.

Conceptual example:

"dog"   → [0.21, -0.43, 0.88, ...]
"hound" → [0.19, -0.41, 0.86, ...]    # close
"car"   → [-0.55, 0.71, 0.04, ...]    # far

Models that generate embeddings are neural networks trained on large corpora so this property holds. Today there are dozens, with different trade-offs in quality, dimension, speed, and cost.

Relevant Models

Three families dominate practical choice:

OpenAI Embeddings (text-embedding-ada-002)

Dimension: 1536.
Managed API: simple, no infrastructure, pay per token.
Cost: around $0.10 per million tokens — very low.
Quality: good for general English cases and reasonable multilingual.
Lock-in: can’t run locally; data passes through OpenAI.

It’s the “reasonable default” if you don’t want to think about infra or models. For many small-medium projects, it’s the sensible choice.

Sentence Transformers (all-MiniLM, all-mpnet, etc.)

Open source, downloadable from Hugging Face^[1].
Dimension: 384 (MiniLM) or 768 (mpnet) typically.
Speed: MiniLM is very fast; mpnet slower but better quality.
Cost: inference is local, you pay only compute (CPU works, GPU accelerates).
Privacy: nothing leaves your infra.

Excellent when privacy matters or you want to avoid third-party dependence.

BGE Models (BAAI General Embedding)

bge-large-en-v1.5 and bge-base-en-v1.5 published by the Beijing Academy of AI.
Current leader in retrieval benchmarks (MTEB) in English.
Open source, locally runnable.
Dimension: 1024 (large) or 768 (base).

If you care about maximum retrieval quality and can run 1-2 GB models locally, BGE is probably the best free option.

How to Choose Between Them

A reasonable decision tree:

Privacy critical or API cost is a problem → Sentence Transformers or BGE local.
English-only, maximum quality → BGE or E5.
Multilingual (ES, EN, FR…) and simplicity → OpenAI ada-002 or multilingual-e5-base.
Speed above all → MiniLM (CPU encoder works reasonably).
Don’t want to think about infra → OpenAI ada-002.

Quality differences among the top-3 are notable in benchmarks but smaller than expected in real apps — corpus quality and chunking usually matter more than the difference between top-tier models.

Use Cases Where They Add Value

Cases where embeddings are the right tool:

Semantic search. “Find documents talking about X” where X can be expressed many ways. Much better than keyword search when users don’t use the exact corpus terms.
RAG. Retrieve relevant context for an LLM before generating an answer. The central piece of 90% of applied LLM products. See also Chroma and pgvector as storage options.
Zero-shot or few-shot classification. Categorise texts without training a classic classifier — text embedding + similarity to embedded labels.
Duplicate or near-duplicate detection. Find similar content (catalog products, blog posts, FAQs).
Content-based recommendation. “Other related articles” based on text similarity, not click history.
Textual anomaly detection. Identify text that strays from the corpus’s typical pattern.

Cases Where They Aren’t the Best Option

Sometimes embeddings get applied by default when something else works better:

Exact-match search. “I want documents with this exact word” → keyword search (Elasticsearch BM25, Postgres FTS) is better.
Strict metadata filtering. “Posts by author X published in 2022” → SQL or a traditional index.
Small corpus (<200 documents) → you can simply put them all in the LLM prompt (if they fit) and skip the vector DB.
Hybrid search with complex filters → better a vector DB with native filters (Weaviate, Qdrant) than a homemade system.

The Chunking Detail

For long-document embeddings, you don’t embed the whole document — you split it into chunks. Decisions that matter:

Size: 200-500 tokens per chunk is a typical range. Too small loses context; too large dilutes signal.
Overlap: 10-20% of tokens shared between consecutive chunks to avoid cutting concepts in half.
Structure: respect headers, paragraphs, and tables if possible. Cutting mid-table generates nonsensical chunks.

Bad chunking is the number-one cause of underperforming RAG — more than the choice of embedding model. This is why frameworks like LangChain include configurable splitters like RecursiveCharacterTextSplitter.

Grafo de vecinos más cercanos representando la estructura de similitud en un espacio de embeddings

Conclusion

Embeddings are the most versatile piece of the modern NLP toolkit. Model choice matters less than it seems for many cases; what most impacts is how you process text before (chunking, normalisation) and what you do with vectors after (search, ranking, filters). Start with the simplest option meeting your privacy and cost constraints, measure, and migrate only when needed.

Was this useful?

[Total: 13 · Average: 4.8]

Post Views: 61

Hugging Face

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.