Skip to content

metawake/awesome-text-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Text Embeddings Awesome

An opinionated buyer's guide for text embeddings in production — RAG, search, classification.

Text embeddings convert text into dense vectors for semantic search, retrieval, clustering, and classification. This list helps you choose the right embedding model for your use case.

Last reviewed: January 2025 · Suggest an update

Quick Picks

Just want a recommendation? Start here:

Use Case Model Why
Best overall (API) text-embedding-3-large Highest quality, 8k context, adjustable dims
Best overall (open) NV-Embed-v2 MTEB #1, 32k context ⚠️ CC-BY-NC
Best budget text-embedding-3-small $0.02/1M tokens, still good quality
Best local/private nomic-embed-text-v2-moe MoE architecture, multilingual, GGUF available
Best multilingual multilingual-e5-large 100+ languages, MIT license
Best for code voyage-code-2 Purpose-built, 16k context

⚠️ = Non-commercial license. Check before using in production.


Contents


How to Choose

Question Recommendation
Need best quality, don't mind API costs? OpenAI text-embedding-3-large or Cohere embed-v3
Want open source, good quality? gte-large-en-v1.5 or bge-large-en-v1.5
Need multilingual? multilingual-e5-large or Cohere embed-multilingual-v3
Working with code? voyage-code-2
Have very long documents? jina-embeddings-v2-base-en (8k) or NV-Embed-v2 (32k)
Running locally/edge? nomic-embed-text-v2-moe or v1.5 (GGUF available)
Need on-prem / data privacy? Open source models only — see Open Source section

Key tradeoffs:

  • Dimensions: Higher = more expressive but more storage/compute. 768-1024 is the sweet spot for most use cases.
  • Context length: Most models max at 512 tokens; some go to 8k+. Longer = fewer chunks needed.
  • Open vs API: Open = privacy, cost control, on-prem; API = simplicity, no infrastructure.
  • Quality vs speed: Larger models score higher on benchmarks but have higher latency.

Common Gotchas

Things that bite engineers in production:

Issue What to watch for
Query/passage prefixes E5 models require query: and passage: prefixes. Without them, quality drops significantly. Check model cards.
Normalization Some models output normalized vectors (use cosine), others don't (use dot product). Mixing these breaks similarity scores.
Matryoshka dimensions Models like OpenAI's and Nomic's support truncating dimensions (e.g., 3072→256). You must re-normalize after truncation.
License traps CC-BY-NC (NV-Embed-v2, SFR-Embedding) = no commercial use. Check before deploying.
Context overflow Tokens beyond max length are silently truncated. For long docs, chunk first or use long-context models.
Embedding drift API providers may update models silently. Pin versions or re-embed periodically if using managed APIs.

General Purpose

Open Source

Model Provider Dims Max Tokens MTEB Avg License Notes
NV-Embed-v2 NVIDIA 4096 32768 72.3 CC-BY-NC-4.0 Current MTEB #1, very long context
Llama-Embed-Nemotron-8B NVIDIA 4096 8192 69.6 Llama 3.1 Open weights, MMTEB leader, multilingual
stella-en-1.5B-v5 NovaSearch 1024 512 66.9 MIT Strong quality, moderate size
gte-large-en-v1.5 Alibaba 1024 8192 65.4 Apache 2.0 Long context, top tier
mxbai-embed-large-v1 Mixedbread 1024 512 64.7 Apache 2.0 Strong MTEB performer
snowflake-arctic-embed-l Snowflake 1024 512 64.5 Apache 2.0 Strong retrieval
bge-large-en-v1.5 BAAI 1024 512 64.2 MIT Widely adopted, battle-tested
gte-base-en-v1.5 Alibaba 768 8192 64.1 Apache 2.0 Smaller + long context
SFR-Embedding-2_R Salesforce 4096 8192 67.5 CC-BY-NC-4.0 Strong retrieval, long context
bge-base-en-v1.5 BAAI 768 512 63.5 MIT Good speed/quality balance
nomic-embed-text-v2-moe Nomic 768 8192 65.8 Apache 2.0 MoE, multilingual, Matryoshka dims
nomic-embed-text-v1.5 Nomic 768 8192 62.3 Apache 2.0 Lighter option, GGUF for local
e5-large-v2 Microsoft 1024 512 62.2 MIT Requires "query:" prefix
e5-base-v2 Microsoft 768 512 61.5 MIT Smaller variant

API Services

Model Provider Dims Max Tokens Pricing (per 1M tokens) Notes
text-embedding-3-large OpenAI 3072 8191 $0.13 Best quality, adjustable dims (Matryoshka)
gemini-embedding-001 Google 3072 8192 $0.00 (free tier) MTEB leader, task-type parameter
voyage-large-2 Voyage AI 1536 16000 $0.12 Longest context
embed-english-v3.0 Cohere 1024 512 $0.10 Strong retrieval
embed-large-v1 Mixedbread 1024 512 $0.05 Good quality/price
embedding-001 Google 768 2048 $0.025 Vertex AI
text-embedding-3-small OpenAI 1536 8191 $0.02 Best budget option
jina-embeddings-v2-base-en Jina AI 768 8192 $0.02 Open weights also available

Specialized

Multilingual

Model Provider Dims Languages Max Tokens Notes
bge-m3 BAAI 1024 100+ 8192 Hybrid dense+sparse, long context
multilingual-e5-large Microsoft 1024 100+ 512 Best open multilingual
EmbeddingGemma-300M Google 768 100+ 2048 Top multilingual under 500M params, Matryoshka dims
multilingual-e5-base Microsoft 768 100+ 512 Smaller variant
embed-multilingual-v3.0 Cohere 1024 100+ 512 API, strong quality
paraphrase-multilingual-mpnet-base-v2 SBERT 768 50+ 512 Sentence-transformers

Code Embeddings

Model Provider Dims Languages Notes
voyage-code-2 Voyage AI 1536 20+ Best code retrieval, 16k context
StarEncoder BigCode 768 80+ StarCoder-based, open source
codebert-base Microsoft 768 6 Open source, smaller
code-search-ada-002 OpenAI 1536 Multiple Legacy but still used

Long-Context

Models supporting 4k+ tokens — useful for embedding full documents without chunking.

Model Provider Dims Max Tokens Notes
NV-Embed-v2 NVIDIA 4096 32768 Longest context (open), MTEB #1
voyage-large-2 Voyage AI 1536 16000 Longest context (API)
gte-large-en-v1.5 Alibaba 1024 8192 Top quality (open)
jina-embeddings-v2-base-en Jina AI 768 8192 Open + API available
nomic-embed-text-v2-moe Nomic 768 8192 MoE, multilingual, GGUF available
text-embedding-3-large OpenAI 3072 8191 Adjustable dimensions
bge-m3 BAAI 1024 8192 Also multilingual

Domain-Specific

Model Provider Domain Dims Notes
legal-bert-base-uncased NLP@AUEb Legal 768 Trained on legal corpora
PubMedBERT Microsoft Biomedical 768 PubMed abstracts
SciBERT Allen AI Scientific 768 Scientific papers
finbert FinBERT Finance 768 Financial sentiment

Rerankers

Rerankers improve retrieval quality by rescoring initial results. Use after embedding-based retrieval.

Model Provider Type Notes
rerank-english-v3.0 Cohere API Production-ready, easy to integrate
rerank-multilingual-v3.0 Cohere API 100+ languages
bge-reranker-v2-m3 BAAI Open Multilingual, pairs with BGE embeddings
bge-reranker-large BAAI Open English-focused, strong quality
ms-marco-MiniLM-L-12-v2 SBERT Open Lightweight, fast
jina-reranker-v2-base-multilingual Jina AI Open 100+ languages, 1k context
mxbai-rerank-large-v1 Mixedbread Open Strong quality

When to use a reranker:

  • You have more than ~20 candidates from initial retrieval
  • Quality matters more than latency
  • Your embedding model's ranking isn't precise enough

Horizon

🔭 Emerging approaches worth watching. These represent paradigm shifts or new capabilities that may reshape best practices.

Unified Generation + Embedding

Model What's New Link
GritLM Single model does both text generation AND embeddings. No need for separate models. 7B params, competitive on MTEB while also being a capable LLM. PaperHuggingFace

Late Chunking

Traditional approach: chunk documents → embed each chunk independently.

Late chunking: embed the full document first (using long-context model), then extract chunk representations that retain document context. Reduces information loss at chunk boundaries.

Resource Description Link
Jina Late Chunking Original technique explanation + implementation Blog
Contextual Retrieval Anthropic's related approach using LLMs to add context Blog

LLM-Based Embeddings

Using decoder-only LLMs as embedding models—often by pooling hidden states or clever prompting.

Approach What's New Link
Echo Embeddings Repeat input text to simulate bidirectional attention in autoregressive LLMs. Simple trick, strong results. Paper (ICLR 2025)
LLM2Vec Convert any decoder LLM into an embedding model via bidirectional attention + masked next token prediction. PaperGitHub

Multimodal Embeddings

Embedding models that handle both text and images together—useful for document retrieval with figures, screenshots, slides.

Model What's New Link
Voyage Multimodal-3 Interleaved text + images. Strong on PDFs, slides, screenshots. Docs
Jina CLIP v2 Open source text-image embeddings, 8k text context HuggingFace

Benchmarks & Leaderboards

Benchmark What it measures Best for Link
MTEB 8 task types (retrieval, classification, clustering, etc.) across 58 datasets, 112 languages Overall embedding quality comparison Leaderboard
BEIR Zero-shot retrieval across 18 diverse datasets Retrieval-focused evaluation GitHub
MIRACL Multilingual retrieval across 18 languages Non-English retrieval GitHub
C-MTEB Chinese-specific embedding tasks Chinese language models Leaderboard

Note: MTEB scores are useful for comparison but don't always predict real-world performance. Test on your own data with tools like ragtune.


Tools & Evaluation

Benchmarking & Comparison

Tool Description Link
ragtune CLI for benchmarking RAG retrieval quality. Compare embedding models on your queries and documents. GitHub
RAGatouille Easy-to-use ColBERT retrieval. Late interaction for better precision than dense embeddings. GitHub
MTEB Official benchmark toolkit for evaluating embeddings on standard tasks GitHub
sentence-transformers Framework for using, comparing, and training embeddings GitHub
Embeddings Projector Visualize high-dimensional embeddings in 2D/3D TensorFlow

Fine-tuning

Tool Description Link
sentence-transformers Training custom embedding models with contrastive learning Docs
FlagEmbedding BAAI's toolkit for fine-tuning BGE models GitHub
uniem Unified embedding model training framework GitHub

Local Inference

Tool Description Link
FastEmbed Fast, lightweight embedding inference by Qdrant GitHub
Infinity High-throughput embedding server, OpenAI-compatible API GitHub
Model2Vec Distill sentence transformers to static embeddings — 500x faster, 50x smaller GitHub
Ollama Run embedding models locally (GGUF format) Ollama
llama.cpp C++ inference for quantized models GitHub
TEI Hugging Face's Text Embeddings Inference server GitHub

Resources

Papers

Foundational:

Recent advances:

Understanding embeddings:

Tutorials


Related Lists

For adjacent topics, see these curated lists:


Contributing

See CONTRIBUTING.md for guidelines on adding new models or tools.


License

CC0

To the extent possible under law, the authors have waived all copyright and related rights to this work.

About

A curated list of text embedding models, benchmarks, and tools for semantic search, retrieval, and classification.

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors