-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for vector db semantic convention #1231
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for creating this PR. A few additional attributes which we instrument today with our SDK that we have found useful are the following:
Thoughts on the ones listed above? cc @lmolkova |
@karthikscale3 regarding the attributes that you proposed, some already exists:
The
Moreover, I found very interesting the proposal of OpenLLMetry project especially the part regarding the attributes for vector db, here:
|
@lmolkova I applied all the feedbacks, thanks for the review. @karthikscale3 I added the Summary of the changes:
|
Yea that sounds good! And yes, my intention was to reuse the existing ones. Wasn't sure if we needed them redefined for the sake of vector dbs or not. But sounds like its unnecessary. |
Thank you! From my side, everything looks good. We discussed this PR in today's working group call and @nirga wanted to take a deeper look at it once again. |
I fixed the merge issues. Thanks @trask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a great start! I wonder if we want to add specific spans that use these attributes in this PR as well?
@nirga can you give me an example of specific span? FYI, I'm going offline and I'll come back August 4 for further discussion. |
Sorry, nvm I think this is already covered as part of the DB semconv |
docs/attributes-registry/db.md
Outdated
@@ -199,6 +200,28 @@ This group defines attributes for Elasticsearch. | |||
|
|||
**[8]:** Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names. | |||
|
|||
## Db Vector Attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Vector Database Attributes
@ezimuel looks like you missed some of the changes you marked as resolved:
|
model/registry/db.yaml
Outdated
brief: > | ||
The model used for the embedding. | ||
examples: 'text-embedding-3-small' | ||
- id: query.top_k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:
- JDBC has Statement.setMaxRows,
- Mongo allows to set a limit
Suggesting db.query.max_returned_items
. The actual returned count could be even better - db.query.item_count
could mean items inserted or returned depending on the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I see the similarity here but I think the db.vector.query.top-k
is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. The limit
parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. using ORDER BY
).
I personally think we should keep db.vector.query.top-k
and potentially add a db.query.limit
(or db.query.max_returned_items
as you suggested) in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of the few dbs I checked, they use limit
in vector search
- pgvector uses traditional
limit
- same with cosmos and other sql databases - mongo uses imit
- qdrant uses limit
So we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.
I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.
So, I'd still prefer db.query.limit
or something similar (and it should be under the same condition as db.query.text
- we cannot require instrumentations to do query parsing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova do you agree that top-k
and limit
are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g. db.query.limit
) to manage both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova just a reminder for this, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezimuel I see that databases use both terms to describe the same thing (see my comment above).
Let's say you have a postgres query like SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
- the general-purpose DB instrumentation can report limit
. If you make it understand vector search syntax, it may be able to use top_k
instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I see your point but I think top-k
has a different meaning from limit
. If you are using a relation database as vector db, limit
is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), the top-k
is more relevant since top
implies the order, using a similarity metric.
I think we should add both:
db.query.limit
db.vector.query.top-k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I found that the SELECT LIMIT
is not part of the SQL standard. A statement compliant with SQL standard is FETCH FIRST
. Moreover, I discovered that we already have db.operation.parameter,<key>
in the semantic convention. This means, we can have db.operation.parameter.fetch-first
. I don't think we need to add a limit
or fetch-first
in db.query
. What do you think?
At the same time, I believe we should include db.vector.query.top-k
. As noted earlier, it differs in meaning from fetch-first
and, additionally, this parameter is much easier to retrieve in a vector database because it's not part of any statement.
We need to reference new attributes in the database spans conventions (see https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml), specifically on the conventions for the databases we have there which support vector search. We should describe how new attributes apply to them. |
@ezimuel I'm sorry I did not reply earlier. I'm swamped with some work at the moment and, unfortunately, it might take me some time to reply. If you look into https://github.com/open-telemetry/semantic-conventions/tree/main/docs/database you'd see that we have some docs for individual database systems - we reference attributes from the registry and we explain how these attributes apply to this system (or don't apply). This is powered by the yaml in https://github.com/open-telemetry/semantic-conventions/tree/main/model/database. Please look at the existing database conventions and update them to include new attributes you're adding. |
@lmolkova thanks for the information and sorry also on my side for the late reply, very busy period. |
@lmolkova and @AlexanderWert I finally found some time to work on this PR. I’ve added the spans definition and updated the documentation. I hope everything looks good. Looking forward to your review. Thanks! |
@@ -764,3 +764,81 @@ groups: | |||
- ref: db.cosmosdb.regions_contacted | |||
requirement_level: | |||
conditionally_required: If available. | |||
|
|||
- id: span.db.vector.client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres, cosmos, mongodb, elasticsearch and others are vector databases too and we need to update their conventions to say how to record vector search capabilities for them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, but how about doing it in a follow up issue / PR, as it is additive information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should introduce span.db.vector.client
span - it's not clear what/when/how will report it.
Also, referencing attributes in the specific conventions is a good test if these attributes are applicable and properly named/described to fit to specific db implementations.
E.g. db.vector.query.top_k
debate will become obvious on postgres - you can't have an attribute like this there and would have to call it db.query.limit
or similar.
It's ok to limit the scope to several DBs and not cover all possible details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should introduce
span.db.vector.client
span
Sorry, I don't get that. We also have a generic span.db.client
(which then in addition has technology specific overrides). Wouldn't it be the same thing here, so having a generic description for vector db client spans + in addition having technology-specific overrides for the DBs listed above.
it's not clear what/when/how will report it
agree, in the readme above we need more explicit / clearer guidance on when to use span.db.vector.client
vs. the more general span.db.client
, but we should do it in a technology-independent way first.
Also, referencing attributes in the specific conventions is a good test if these attributes are applicable and properly named/described to fit to specific db implementations.
...
It's ok to limit the scope to several DBs and not cover all possible details.
Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @AlexanderWert . I think we should have a generic span.db.vector.client
for vector databases that do not have technology-specific conventions, like the generic sql clients.
@lmolkova I see the issue with db.vector.query.top_k
and db.query.limit
, I think I can remove the top-k
even if they are semantically different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get that. We also have a generic span.db.client (which then in addition has technology specific overrides). Wouldn't it be the same thing here, so having a generic description for vector db client spans + in addition having technology-specific overrides for the DBs listed above.
The problem is that none of the databases we have in the semconv should be considered a vector databases then. The span.db.vector.client
is an abstract thing nothing is going to use. MongoDB, CosmosDB, elastic, postres - all of them are used as vector databases - will they extend span.db.vector.client
? Will generic instrumentations for them cover vector-related operations?
Also we're trying to introduce attributes without attempting to apply them to existing databases - this is the major blocker I see in this PR.
Co-authored-by: Alexander Wert <AlexanderWert@users.noreply.github.com>
Co-authored-by: Alexander Wert <AlexanderWert@users.noreply.github.com>
Co-authored-by: Alexander Wert <AlexanderWert@users.noreply.github.com>
Co-authored-by: Alexander Wert <AlexanderWert@users.noreply.github.com>
Co-authored-by: Alexander Wert <AlexanderWert@users.noreply.github.com>
Thanks @AlexanderWert for the feedback. |
|
||
| Value | Description | Stability | | ||
|---|---|---| | ||
| `cosine` | The cosine metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- There are cosine similarity and cosine distance. Better to distinguish the two in the convention.
- There are a few common ones that are not listed here: Squared Euclidean and hamming.
|
||
**[11] `db.elasticsearch.path_parts`:** Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names. | ||
`db.search.similarity_metric` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The similarity metric doesn't need to be defined per search request. It can also be defined when the collection is created.
|
||
| Attribute | Type | Description | Examples | Stability | | ||
|---|---|---|---|---| | ||
| <a id="db-vector-dimension-count" href="#db-vector-dimension-count">`db.vector.dimension_count`</a> | int | The dimension of the vector. | `3` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: db-vector-dimension-count
-> db-vector-dimension
.
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`db.operation.name`](/docs/attributes-registry/db.md) | string | The operation to be performed on the vector database (e.g. build an index/collection) [1] | `build`; `insert`; `search`; `delete` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||
| [`http.request.method`](/docs/attributes-registry/http.md) | string | HTTP request method. [2] | `GET`; `POST`; `HEAD` | `Required` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http.request.method
should not be used on generic database span
This is a proposal for vector db semantic convention (see #936). I tried to expand the
db
semantic convention adding somedb.vector
attributes. I tried to focus on the basic needs of a general purpose vector database.I proposed the following experimental attributes (updated with the feedbacks in this PR):
The operations performed in a vector db, such as
insert
,update
,search
anddelete
can be performed using the existing db.operation.name attribute.Regarding the similarity search we can use the db.query attributes, such as
db.query.parameter.<key>
.