-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate zero vector when using cosine metric #1501
Validate zero vector when using cosine metric #1501
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1501 +/- ##
=========================================
Coverage 85.16% 85.16%
- Complexity 1291 1300 +9
=========================================
Files 169 171 +2
Lines 5258 5293 +35
Branches 498 505 +7
=========================================
+ Hits 4478 4508 +30
- Misses 570 574 +4
- Partials 210 211 +1 ☔ View full report in Codecov by Sentry. |
Thanks @bugmakerrrrrr could you update the Changelog as well? https://github.com/opensearch-project/k-NN/blob/main/CHANGELOG.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Thanks for this! Will wait to approve on changelog verification
*/ | ||
public static boolean isZeroVector(byte[] vector) { | ||
for (byte e : vector) { | ||
if (e != (byte) 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: Do we need 0 cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary, it's just my habit. I can remove it.
e4abb8e
to
9b86b35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the change!
@bugmakerrrrrr please add the successful indexing and query I/O either in description or reply to this message to provide information what is the experience now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you
new KNNVectorFieldType(buildFullName(context), metaValue, -1, knnMethodContext, modelIdAsString), | ||
new KNNVectorFieldType(buildFullName(context), metaValue, -1, modelIdAsString), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are making this change? and is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
knnMethodContext
is always null for ModelFieldMapper
, I believe that this change will make it more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend keeping it as is. This will ensure that changes are minimal for this PR.
import static org.opensearch.knn.index.mapper.KNNVectorFieldMapperUtil.validateByteVector; | ||
import static org.opensearch.knn.index.mapper.KNNVectorFieldMapperUtil.validateByteVectorValue; | ||
import static org.opensearch.knn.index.mapper.KNNVectorFieldMapperUtil.validateFloatVector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems these utility class should be renamed or these functions which are shared between mapper and Query should be put in KNNValidationUtil to ensure that these functions can be shared between Query and Mapper classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. I can pull these out into a separate KNNValidationUtil
class
public static void validateByteVector(byte[] vector, SpaceType spaceType) { | ||
if (spaceType == SpaceType.COSINESIMIL && isZeroVector(vector)) { | ||
throw new IllegalArgumentException( | ||
String.format(Locale.ROOT, "zero vector is not supported when space type is [%s]", spaceType.getValue()) | ||
); | ||
} | ||
} | ||
|
||
/** | ||
* Validate if the given float vector is supported by the given space type | ||
* | ||
* @param vector the given vector | ||
* @param spaceType the given space type | ||
*/ | ||
public static void validateFloatVector(float[] vector, SpaceType spaceType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make these parts of SpaceType enum class? Also, does these validations true for all engines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the zero vector validation is same for all engines, but maybe not true for other validations introduced in the future. For example, the dot product
metric in Lucene requires all vectors must be normalized, but Faiss doesn't, so I tend to keep these part in this class, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if its true for all engines that it makes more sense to add as part of enum class.
package org.opensearch.knn.common; | ||
|
||
public class KNNVectorUtil { | ||
private KNNVectorUtil() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use @NoArgsConstructor(access = Access.PRIVATE)
public static boolean isZeroVector(byte[] vector) { | ||
for (byte e : vector) { | ||
if (e != 0) { | ||
return false; | ||
} | ||
} | ||
return true; | ||
} | ||
|
||
/** | ||
* Check if all the elements of a given vector are zero | ||
* | ||
* @param vector the vector | ||
* @return true if yes; otherwise false | ||
*/ | ||
public static boolean isZeroVector(float[] vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets validate that these arrays are no null
|
||
if (fieldDimension == -1) { | ||
// If dimension is not set, the field uses a model and the information needs to be retrieved from there | ||
assert spaceType == null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Putting assert here is dangerous as this will lead to exception messages not understood by users. Lets make sure that we are making sure proper exceptions are thrown here.
-
Due to this change the comment location have been moved. Lets ensure the comments is at right position.
@@ -308,6 +314,9 @@ protected Query doToQuery(QueryShardContext context) { | |||
validateByteVectorValue(vector[i]); | |||
byteVector[i] = (byte) vector[i]; | |||
} | |||
validateByteVector(byteVector, spaceType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this validation while reading the vectors in above loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot reuse the validation function or short circuit like VectorUtil.isZeroVector
if we implement it in the loop. I prefer not to do so
@bugmakerrrrrr can we add a BWC test where we try to add 0 vectors in older version and then zero vectors in upgraded version. Same for the query too. |
why do we need a BWC test? I think this a breaking change |
if this is a breaking change, then what are the steps that users needs to take so that they can safely upgrade? cc: @jmazanec15 |
If the user upgrades the cluster, any request containing a zero vector will be rejected and fail when using cosine metric. And during the upgrade process, requests containing zero vectors will succeed on the old nodes but will be rejected on the upgraded nodes. If the user uses non-zero vectors under cosine metric, there will be no impact. However, if the user uses zero vectors, it may be considered as an abuse or misconfiguration, and request failure may be acceptable. |
I agree with @bugmakerrrrrr . From #1461 (comment), the error is:
The In terms of BWC, if a shard containing a 0 vector is moved to a new node, it wont fail to migrate - we dont check. It will only fail if a user attempts to reindex. That being said, I am not concerned about breaking BWC. |
So what we are suggesting here is, if a user has add a zero vector already in the index after upgrade we are okay to break the customer search query as zero vector has potential to go with divide by 0 exceptions. |
@bugmakerrrrrr Code looks good to me. Please resolve the conflicts in Changelog.md file. |
We would fail the query only if query vector is 0 (not if index vectors are 0). If it is 0, it will have meaningless results anyway. |
Makes sense. @jmazanec15 we should have a documentation issue also created for this. We need to capture this detail that providing zero vector for Cosine space type will start throwing exception going forward. |
Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: panguixin <panguixin@bytedance.com>
Signed-off-by: panguixin <panguixin@bytedance.com>
ff7137f
to
7635cc6
Compare
@navneet1v added doc issue: opensearch-project/documentation-website#6678. |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1501-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b7bdda4f7d05b3e3e4248020ce01b2c888359a7e
# Push it to GitHub
git push --set-upstream origin backport/backport-1501-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
Ensure zero vector is not used when using functionality with cosine similarity metric. Signed-off-by: panguixin <panguixin@bytedance.com> (cherry picked from commit b7bdda4)
Ensure zero vector is not used when using functionality with cosine similarity metric. Signed-off-by: panguixin <panguixin@bytedance.com> (cherry picked from commit b7bdda4) Signed-off-by: John Mazanec <jmazane@amazon.com>
Description
Validate the vector passed by the user during querying or indexing, if the vector is zero vector and the space type of index is cosine, an exception will be thrown.
Examples
Issues Resolved
#1461
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.