Awesome Text Embeddings

An opinionated buyer's guide for text embeddings in production — RAG, search, classification.

Text embeddings convert text into dense vectors for semantic search, retrieval, clustering, and classification. This list helps you choose the right embedding model for your use case.

Last reviewed: January 2025 · Suggest an update

Quick Picks

Just want a recommendation? Start here:

Use Case	Model	Why
Best overall (API)	text-embedding-3-large	Highest quality, 8k context, adjustable dims
Best overall (open)	NV-Embed-v2	MTEB #1, 32k context ⚠️ CC-BY-NC
Best budget	text-embedding-3-small	$0.02/1M tokens, still good quality
Best local/private	nomic-embed-text-v2-moe	MoE architecture, multilingual, GGUF available
Best multilingual	multilingual-e5-large	100+ languages, MIT license
Best for code	voyage-code-2	Purpose-built, 16k context

⚠️ = Non-commercial license. Check before using in production.

How to Choose

Question	Recommendation
Need best quality, don't mind API costs?	OpenAI `text-embedding-3-large` or Cohere `embed-v3`
Want open source, good quality?	`gte-large-en-v1.5` or `bge-large-en-v1.5`
Need multilingual?	`multilingual-e5-large` or Cohere `embed-multilingual-v3`
Working with code?	`voyage-code-2`
Have very long documents?	`jina-embeddings-v2-base-en` (8k) or `NV-Embed-v2` (32k)
Running locally/edge?	`nomic-embed-text-v2-moe` or `v1.5` (GGUF available)
Need on-prem / data privacy?	Open source models only — see Open Source section

Key tradeoffs:

Dimensions: Higher = more expressive but more storage/compute. 768-1024 is the sweet spot for most use cases.
Context length: Most models max at 512 tokens; some go to 8k+. Longer = fewer chunks needed.
Open vs API: Open = privacy, cost control, on-prem; API = simplicity, no infrastructure.
Quality vs speed: Larger models score higher on benchmarks but have higher latency.

Common Gotchas

Things that bite engineers in production:

Issue	What to watch for
Query/passage prefixes	E5 models require `query:` and `passage:` prefixes. Without them, quality drops significantly. Check model cards.
Normalization	Some models output normalized vectors (use cosine), others don't (use dot product). Mixing these breaks similarity scores.
Matryoshka dimensions	Models like OpenAI's and Nomic's support truncating dimensions (e.g., 3072→256). You must re-normalize after truncation.
License traps	CC-BY-NC (NV-Embed-v2, SFR-Embedding) = no commercial use. Check before deploying.
Context overflow	Tokens beyond max length are silently truncated. For long docs, chunk first or use long-context models.
Embedding drift	API providers may update models silently. Pin versions or re-embed periodically if using managed APIs.

General Purpose

Open Source

Model	Provider	Dims	Max Tokens	MTEB Avg	License	Notes
NV-Embed-v2	NVIDIA	4096	32768	72.3	CC-BY-NC-4.0	Current MTEB #1, very long context
Llama-Embed-Nemotron-8B	NVIDIA	4096	8192	69.6	Llama 3.1	Open weights, MMTEB leader, multilingual
stella-en-1.5B-v5	NovaSearch	1024	512	66.9	MIT	Strong quality, moderate size
gte-large-en-v1.5	Alibaba	1024	8192	65.4	Apache 2.0	Long context, top tier
mxbai-embed-large-v1	Mixedbread	1024	512	64.7	Apache 2.0	Strong MTEB performer
snowflake-arctic-embed-l	Snowflake	1024	512	64.5	Apache 2.0	Strong retrieval
bge-large-en-v1.5	BAAI	1024	512	64.2	MIT	Widely adopted, battle-tested
gte-base-en-v1.5	Alibaba	768	8192	64.1	Apache 2.0	Smaller + long context
SFR-Embedding-2_R	Salesforce	4096	8192	67.5	CC-BY-NC-4.0	Strong retrieval, long context
bge-base-en-v1.5	BAAI	768	512	63.5	MIT	Good speed/quality balance
nomic-embed-text-v2-moe	Nomic	768	8192	65.8	Apache 2.0	MoE, multilingual, Matryoshka dims
nomic-embed-text-v1.5	Nomic	768	8192	62.3	Apache 2.0	Lighter option, GGUF for local
e5-large-v2	Microsoft	1024	512	62.2	MIT	Requires "query:" prefix
e5-base-v2	Microsoft	768	512	61.5	MIT	Smaller variant

API Services

Model	Provider	Dims	Max Tokens	Pricing (per 1M tokens)	Notes
text-embedding-3-large	OpenAI	3072	8191	$0.13	Best quality, adjustable dims (Matryoshka)
gemini-embedding-001	Google	3072	8192	$0.00 (free tier)	MTEB leader, task-type parameter
voyage-large-2	Voyage AI	1536	16000	$0.12	Longest context
embed-english-v3.0	Cohere	1024	512	$0.10	Strong retrieval
embed-large-v1	Mixedbread	1024	512	$0.05	Good quality/price
embedding-001	Google	768	2048	$0.025	Vertex AI
text-embedding-3-small	OpenAI	1536	8191	$0.02	Best budget option
jina-embeddings-v2-base-en	Jina AI	768	8192	$0.02	Open weights also available

Specialized

Multilingual

Model	Provider	Dims	Languages	Max Tokens	Notes
bge-m3	BAAI	1024	100+	8192	Hybrid dense+sparse, long context
multilingual-e5-large	Microsoft	1024	100+	512	Best open multilingual
EmbeddingGemma-300M	Google	768	100+	2048	Top multilingual under 500M params, Matryoshka dims
multilingual-e5-base	Microsoft	768	100+	512	Smaller variant
embed-multilingual-v3.0	Cohere	1024	100+	512	API, strong quality
paraphrase-multilingual-mpnet-base-v2	SBERT	768	50+	512	Sentence-transformers

Code Embeddings

Model	Provider	Dims	Languages	Notes
voyage-code-2	Voyage AI	1536	20+	Best code retrieval, 16k context
StarEncoder	BigCode	768	80+	StarCoder-based, open source
codebert-base	Microsoft	768	6	Open source, smaller
code-search-ada-002	OpenAI	1536	Multiple	Legacy but still used

Long-Context

Models supporting 4k+ tokens — useful for embedding full documents without chunking.

Model	Provider	Dims	Max Tokens	Notes
NV-Embed-v2	NVIDIA	4096	32768	Longest context (open), MTEB #1
voyage-large-2	Voyage AI	1536	16000	Longest context (API)
gte-large-en-v1.5	Alibaba	1024	8192	Top quality (open)
jina-embeddings-v2-base-en	Jina AI	768	8192	Open + API available
nomic-embed-text-v2-moe	Nomic	768	8192	MoE, multilingual, GGUF available
text-embedding-3-large	OpenAI	3072	8191	Adjustable dimensions
bge-m3	BAAI	1024	8192	Also multilingual

Domain-Specific

Model	Provider	Domain	Dims	Notes
legal-bert-base-uncased	NLP@AUEb	Legal	768	Trained on legal corpora
PubMedBERT	Microsoft	Biomedical	768	PubMed abstracts
SciBERT	Allen AI	Scientific	768	Scientific papers
finbert	FinBERT	Finance	768	Financial sentiment

Rerankers

Rerankers improve retrieval quality by rescoring initial results. Use after embedding-based retrieval.

Model	Provider	Type	Notes
rerank-english-v3.0	Cohere	API	Production-ready, easy to integrate
rerank-multilingual-v3.0	Cohere	API	100+ languages
bge-reranker-v2-m3	BAAI	Open	Multilingual, pairs with BGE embeddings
bge-reranker-large	BAAI	Open	English-focused, strong quality
ms-marco-MiniLM-L-12-v2	SBERT	Open	Lightweight, fast
jina-reranker-v2-base-multilingual	Jina AI	Open	100+ languages, 1k context
mxbai-rerank-large-v1	Mixedbread	Open	Strong quality

When to use a reranker:

You have more than ~20 candidates from initial retrieval
Quality matters more than latency
Your embedding model's ranking isn't precise enough

Horizon

🔭 Emerging approaches worth watching. These represent paradigm shifts or new capabilities that may reshape best practices.

Unified Generation + Embedding

Model	What's New	Link
GritLM	Single model does both text generation AND embeddings. No need for separate models. 7B params, competitive on MTEB while also being a capable LLM.	Paper ・ HuggingFace

Late Chunking

Traditional approach: chunk documents → embed each chunk independently.

Late chunking: embed the full document first (using long-context model), then extract chunk representations that retain document context. Reduces information loss at chunk boundaries.

Resource	Description	Link
Jina Late Chunking	Original technique explanation + implementation	Blog
Contextual Retrieval	Anthropic's related approach using LLMs to add context	Blog

LLM-Based Embeddings

Using decoder-only LLMs as embedding models—often by pooling hidden states or clever prompting.

Approach	What's New	Link
Echo Embeddings	Repeat input text to simulate bidirectional attention in autoregressive LLMs. Simple trick, strong results.	Paper (ICLR 2025)
LLM2Vec	Convert any decoder LLM into an embedding model via bidirectional attention + masked next token prediction.	Paper ・ GitHub

Multimodal Embeddings

Embedding models that handle both text and images together—useful for document retrieval with figures, screenshots, slides.

Model	What's New	Link
Voyage Multimodal-3	Interleaved text + images. Strong on PDFs, slides, screenshots.	Docs
Jina CLIP v2	Open source text-image embeddings, 8k text context	HuggingFace

Benchmarks & Leaderboards

Benchmark	What it measures	Best for	Link
MTEB	8 task types (retrieval, classification, clustering, etc.) across 58 datasets, 112 languages	Overall embedding quality comparison	Leaderboard
BEIR	Zero-shot retrieval across 18 diverse datasets	Retrieval-focused evaluation	GitHub
MIRACL	Multilingual retrieval across 18 languages	Non-English retrieval	GitHub
C-MTEB	Chinese-specific embedding tasks	Chinese language models	Leaderboard

Note: MTEB scores are useful for comparison but don't always predict real-world performance. Test on your own data with tools like ragtune.

Tools & Evaluation

Benchmarking & Comparison

Tool	Description	Link
ragtune	CLI for benchmarking RAG retrieval quality. Compare embedding models on your queries and documents.	GitHub
RAGatouille	Easy-to-use ColBERT retrieval. Late interaction for better precision than dense embeddings.	GitHub
MTEB	Official benchmark toolkit for evaluating embeddings on standard tasks	GitHub
sentence-transformers	Framework for using, comparing, and training embeddings	GitHub
Embeddings Projector	Visualize high-dimensional embeddings in 2D/3D	TensorFlow

Fine-tuning

Tool	Description	Link
sentence-transformers	Training custom embedding models with contrastive learning	Docs
FlagEmbedding	BAAI's toolkit for fine-tuning BGE models	GitHub
uniem	Unified embedding model training framework	GitHub

Local Inference

Tool	Description	Link
FastEmbed	Fast, lightweight embedding inference by Qdrant	GitHub
Infinity	High-throughput embedding server, OpenAI-compatible API	GitHub
Model2Vec	Distill sentence transformers to static embeddings — 500x faster, 50x smaller	GitHub
Ollama	Run embedding models locally (GGUF format)	Ollama
llama.cpp	C++ inference for quantized models	GitHub
TEI	Hugging Face's Text Embeddings Inference server	GitHub

Resources

Papers

Foundational:

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019) — Started modern sentence embeddings
MTEB: Massive Text Embedding Benchmark (2022) — The standard benchmark
Text and Code Embeddings by Contrastive Pre-Training (2022) — OpenAI's approach

Recent advances:

Improving Text Embeddings with Large Language Models (2024) — LLM-based embedding training (E5-mistral)
BGE M3-Embedding (2024) — Multi-lingual, multi-functionality, multi-granularity
Matryoshka Representation Learning (2022) — Flexible dimension embeddings
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models (2024) — NVIDIA's approach
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings (2023) — Long-context embeddings

Understanding embeddings:

Text Embeddings Reveal (Almost) As Much As Text (2023) — Privacy implications
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation (2021) — Retrieval benchmark

Tutorials

Sentence-Transformers Documentation — Comprehensive embedding guide
Hugging Face NLP Course — Includes embedding fundamentals
Choosing an Embedding Model — Pinecone's practical guide
Cohere Embed Guide — Good API-focused tutorial

Related Lists

For adjacent topics, see these curated lists:

awesome-vector-databases — Vector storage and retrieval
awesome-rag — Retrieval-augmented generation
awesome-semantic-search — Semantic search resources
awesome-local-ai — Local AI inference

Contributing

See CONTRIBUTING.md for guidelines on adding new models or tools.

License

To the extent possible under law, the authors have waived all copyright and related rights to this work.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Text Embeddings

Quick Picks

Contents

How to Choose

Common Gotchas

General Purpose

Open Source

API Services

Specialized

Multilingual

Code Embeddings

Long-Context

Domain-Specific

Rerankers

Horizon

Unified Generation + Embedding

Late Chunking

LLM-Based Embeddings

Multimodal Embeddings

Benchmarks & Leaderboards

Tools & Evaluation

Benchmarking & Comparison

Fine-tuning

Local Inference

Resources

Papers

Tutorials

Related Lists

Contributing

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages