Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

docs: add embedding page #759

Merged
merged 5 commits into from
Jul 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Docs

- Add Jina embeddings documentation page. ([#759](https://github.com/jina-ai/finetuner/pull/759))


## [0.7.8] - 2023-06-08

Expand Down
3 changes: 1 addition & 2 deletions docs/get-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@ Make sure you have `Python 3.8+` installed on Linux/Mac/Windows:
pip install -U finetuner
```

If you want to encode your data locally with the {meth}`~finetuner.encode` function, you need to install `"finetuner[full]"`.
In this case, some extra dependencies are installed which are necessary to do the inference, e.g., torch, torchvision, and open clip:
If you want to submit a fine-tuning job on the cloud, please use:

```bash
pip install "finetuner[full]"
Expand Down
47 changes: 47 additions & 0 deletions docs/get-started/pretrained.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
(pretrained-models)=
# {octicon}`rocket` Jina Embeddings

Starting with Finetuner 0.7.9,
we have introduced a suite of pre-trained text embedding models licensed under Apache 2.0.
The model have a range of useThese models have a variety of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
The suite consists of the following models:

- `jina-embedding-s-en-v1` **[Huggingface](jinaai/jina-embedding-s-en-v1)**: This is a compact model with just 35 million parameters, that performs lightning-fast inference while delivering impressive performance.
- `jina-embedding-b-en-v1` **[Huggingface](jinaai/jina-embedding-b-en-v1)**: This model has a size of 110 million parameters, performs fast inference and delivers better performance than our smaller model.
- `jina-embedding-l-en-v1` **[Huggingface](jinaai/jina-embedding-l-en-v1)**: This is a relatively large model with a size of 330 million parameters, that performs single-gpu inference and delivers better performance than our other model.

## Usage

```python
import finetuner

model = finetuner.build_model('jinaai/jina-embedding-s-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
```

## Training Data

Jina Embeddings is a suite of language models that have been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million query-document pairs of sentences.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.

## Characteristics

Each Jina embedding model can encode up to 512 tokens,
with any further tokens being truncated.
The models have different output dimensionalities, as shown in the table below:

|Name|param |context| Dimension |
|------------------------------|-----|------|-----------|
|jina-embedding-s-en-v1|35m |512| 512 |
|jina-embedding-b-en-v1|110m |512| 768 |
|jina-embedding-l-en-v1|330m |512| 1024 |

## Performance

Please refer to the [Huggingface](jinaai/jina-embedding-s-en-v1) page.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

get-started/how-it-works
get-started/installation
get-started/pretrained
walkthrough/index
```

Expand Down
Loading