Skip to content

docs: fix semantic similarity description (cross-encoder -> bi-encoder) #1910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Ayaka-mogumogu
Copy link
Contributor

This PR updates the documentation to correctly describe the Semantic similarity.

Issue

The documentation previously stated that a cross-encoder was used for computing the semantic similarity score. However, after reviewing the implementation, it is clear that the current approach follows a bi-encoder strategy:

  • The ground truth and response are encoded independently
  • Their embeddings are then compared using cosine similarity

A cross-encoder would typically process both texts together in a single forward pass (e.g., concatenating them before encoding), which is not the case in the current implementation.

Current Implementation

For example, in the current implementation:

embedding_1 = np.array(await self.embeddings.embed_text(ground_truth))
embedding_2 = np.array(await self.embeddings.embed_text(answer))
# Normalization factors of the above embeddings
norms_1 = np.linalg.norm(embedding_1, keepdims=True)
norms_2 = np.linalg.norm(embedding_2, keepdims=True)
embedding_1_normalized = embedding_1 / norms_1
embedding_2_normalized = embedding_2 / norms_2
similarity = embedding_1_normalized @ embedding_2_normalized.T
score = similarity.flatten()

This code shows that the ground truth and response are encoded separately, and their similarity is computed using cosine similarity, which is characteristic of a bi-encoder approach.

Fix

The term "cross-encoder" has been corrected to "bi-encoder" in the documentation to ensure consistency with the actual implementation.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 9, 2025
Copy link
Member

@shahules786 shahules786 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thank you. We changed from using cross-encoder to biencoder but forgot to update the docs!

@shahules786 shahules786 merged commit dcfd58b into explodinggradients:main Feb 14, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants