A vector embedding encodes an input as a list of floating point numbers.
"dog" → [0.017198, -0.007493, -0.057982, 0.054051, -0.028336, 0.019245,…]
Different models output different embeddings, with varying lengths.
Model | Encodes | Vector length |
---|---|---|
word2vec | words | 300 |
Sbert (Sentence-Transformers) | text (up to ~400 words) | 768 |
OpenAI text-embedding-ada-002 | text (up to 8191 tokens) | 1536 |
OpenAI text-embedding-3-small | text (up to 8191 tokens) | 256-1536 |
OpenAI text-embedding-3-large | text (up to 8191 tokens) | 256-3072 |
Azure AI Vision | image or text | 1024 |
Vector embeddings are commonly used for similarity search, fraud detection, recommendation systems, and RAG (Retrieval-Augmented Generation).
This repository contains a visual exploration of vectors, using several embedding models.
Before running the notebooks, install the requirements:
pip install -r requirements.txt
Then explore these notebooks:
- Generate new OpenAI text embeddings
- Compare OpenAI and Word2Vec embeddings
- Vector similarity
- Vector search
- Generate multimodal vectors for dataset
- Explore multimodal vectors
- Vector distance metrics
- Vector quantization
- Vector dimension reduction (MRL)
These notebooks are also provided, but aren't necessary unless you're generating new embeddings data.
Each notebook has resources at the bottom to dig further into that topic. Here are some additional general resources: