-
Notifications
You must be signed in to change notification settings - Fork 291
Unique check #820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique check #820
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #820 +/- ##
==========================================
+ Coverage 93.48% 93.51% +0.02%
==========================================
Files 102 102
Lines 14606 14689 +83
Branches 25 25
==========================================
+ Hits 13655 13736 +81
- Misses 945 947 +2
Partials 6 6
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
CodSpeed Performance ReportMerging #820 will not alter performanceComparing Summary
|
src/input/return_enums.rs
Outdated
let mut set = AHashSet::with_capacity($data.len()); | ||
for $item in $data.iter() { | ||
set.insert($item_py.hash()?); | ||
} | ||
set.len() != $data.len() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe. You can't just compare the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as I mentioned on the PR comment, I think taking python's notion of uniqueness - which just compares hashes - would be an ok behavior. Otherwise, an hypothetical validation of conset(unique=True)
over a value [1, True]
would pass, but would return {1}
, which just seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I could take type into consideration, which would result in the above example passing, but that would be weird IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "python's notion of uniqueness - which just compares hashes"? Two items are not considered the same just because they have the same hash — they are also checked for equality if they do have the same hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code example demonstrating:
class A:
def __hash__(self):
return 0
class B:
def __hash__(self):
return 0
my_dict = {A(): 1, B(): 2, 0: 3}
for k, v in my_dict.items():
print(hash(k), k, v)
"""
0 <__main__.A object at 0x1052a1d90> 1
0 <__main__.B object at 0x1052a1dd0> 2
0 0 3
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean, yeah, my bad. I'll change the code to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, even having same hash and same type does not guarantee equality:
...
my_dict = {A(): 1, B(): 2, 0: 3, A(): 4}
for k, v in my_dict.items():
print(hash(k), k, v)
"""
0 <__main__.A object at 0x1052a1d90> 1
0 <__main__.B object at 0x1052a1dd0> 2
0 0 3
0 <__main__.A object at 0x100945e10> 4
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, __eq__
can also get overridden anyway. I thought the 'this is not safe' comment was mainly related to Python treating a few different types as equal (True == 1
), but I get it now. I already have an idea of how I can fix this, I'll try to work on it later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, as far as I normally think of "uniqueness" in Python, 1
and True
are equivalent, even if it seems a little counterintutive.
>>> {1, True}
{1}
I would wager the only way this can be implemented correctly is by using Python hashing + equality.
Sorry for initially sending the PR without considering This will still consider
Something that I am not currently doing, but could be implemented, is to aggregate all non-hashable values under a single hash (say, |
Is there a huge advantage over doing something like |
As per #296, the idea of UniqueList = Annotated[List[T], AfterValidator(lambda x: assert len(set(x)) == len(x)] The main difference from doing this in Rust would be to be more efficient, given |
You’re going to have a to build a set like thing somewhere. And call Python hash functions, which I’m guessing is the slowest part. Given the niche use case and complexity of this implementation I don’t think it’s worth adding this to pydantic-core, sorry. |
This is not really true? The whole idea of the algorithm is to do this check without building a python set. That said, I do agree it's a bit niche. Can the related ticket get closed then? |
@adriangb is there any other way to:
In my opinion ordering is quite important thing in many cases, so you can't just replace |
The solution above using an AfterValidator to check for uniqueness can be combined with a |
Try this: from typing import Annotated, Hashable, List, TypeVar
from pydantic_core import PydanticCustomError
from pydantic import AfterValidator, Field, ValidationError
from pydantic.type_adapter import TypeAdapter
T = TypeVar('T', bound=Hashable)
def _validate_unique_list(v: list[T]) -> list[T]:
if len(v) != len(set(v)):
raise PydanticCustomError('unique_list', 'List must be unique')
return v
UniqueList = Annotated[List[T], AfterValidator(_validate_unique_list), Field(json_schema_extra={'uniqueItems': True})]
ta = TypeAdapter(UniqueList[int])
assert ta.validate_python([1, 2, 3]) == [1, 2, 3]
try:
ta.validate_python([1, 2, 2])
except ValidationError as e:
assert e.errors() == [{'type': 'unique_list', 'loc': (), 'msg': 'List must be unique', 'input': [1, 2, 2]}]
else:
assert False, 'should have raised'
assert ta.json_schema() == {'items': {'type': 'integer'}, 'type': 'array', 'uniqueItems': True} |
Should |
I don't know if this is the right place to ask this question. I looked through many PR requests about this unique property both in v1 and v2. But, all solutions depend on the equality of objects (or their hashes) inside the list. For example: from typing import List
from pydantic import BaseModel, EmailStr, Field
class User(BaseModel):
email: EmailStr
username: str
name: str
class Form(BaseModel):
users: List[User] = Field(unique_field=['email', 'username']) What is the cleanest way to implement such a Form model where User.email and User.username should be unique across the list? |
Interestingly #820 (comment) causes the
vs:
|
Change Summary
Implements 'unique' validation for
set
,list
,tuple-variable
andtuple-positional
, so something along the lines ofunique_items
in v1 can be implemented again in v2.Initially I planned to implement this mainly using a
AHashSet
, only resorting to sorting the values for small lists, but it turned out that the sorting ended up more efficient in every single case. Here's a repo where I benchmarked both solutions against a few cases:https://github.com/rockisch/unique-check-bench
The implementation aggregates values over
__hash__
, and the compares each of them with__eq__
. This should be fine, given the description of__hash__
on Python's documentation [1]:Something that I am not currently doing, but could be implemented, is to aggregate all non-hashable values under a single hash (say,
-1
) in the algorithm, so they get compared against all other non-hashable types. This should have little impact on iterator over hashable types (list[str]
), but would allow iterators overBaseClass
subclasses to be unique cheked, although in a much less efficient manner. I feel like asking the user to implement__hash__
themselves is probably the better approach, but it is a possibility.This is kind of my first contribution to a 'big' open-source project, so let me know if there's anything I missed. Also, not sure if I should already ask for reviewers or wait for someone to take a look at this first.
Related issue number
fix #296
Checklist
pydantic-core
(except for expected changes)