Skip to content

Unique check #820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Unique check #820

wants to merge 3 commits into from

Conversation

rockisch
Copy link

@rockisch rockisch commented Jul 25, 2023

Change Summary

Implements 'unique' validation for set, list, tuple-variable and tuple-positional, so something along the lines of unique_items in v1 can be implemented again in v2.

Initially I planned to implement this mainly using a AHashSet, only resorting to sorting the values for small lists, but it turned out that the sorting ended up more efficient in every single case. Here's a repo where I benchmarked both solutions against a few cases:
https://github.com/rockisch/unique-check-bench

The implementation aggregates values over __hash__, and the compares each of them with __eq__ . This should be fine, given the description of __hash__ on Python's documentation [1]:

The only required property is that objects which compare equal have the same hash value

Something that I am not currently doing, but could be implemented, is to aggregate all non-hashable values under a single hash (say, -1) in the algorithm, so they get compared against all other non-hashable types. This should have little impact on iterator over hashable types (list[str]), but would allow iterators over BaseClass subclasses to be unique cheked, although in a much less efficient manner. I feel like asking the user to implement __hash__ themselves is probably the better approach, but it is a possibility.

This is kind of my first contribution to a 'big' open-source project, so let me know if there's anything I missed. Also, not sure if I should already ask for reviewers or wait for someone to take a look at this first.

Related issue number

fix #296

Checklist

  • Unit tests for the changes exist
  • Documentation reflects the changes where applicable
  • Pydantic tests pass with this pydantic-core (except for expected changes)
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

@codecov
Copy link

codecov bot commented Jul 25, 2023

Codecov Report

Merging #820 (118c5c3) into main (f5ef7af) will increase coverage by 0.02%.
Report is 4 commits behind head on main.
The diff coverage is 98.37%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #820      +/-   ##
==========================================
+ Coverage   93.48%   93.51%   +0.02%     
==========================================
  Files         102      102              
  Lines       14606    14689      +83     
  Branches       25       25              
==========================================
+ Hits        13655    13736      +81     
- Misses        945      947       +2     
  Partials        6        6              
Files Changed Coverage Δ
src/input/mod.rs 100.00% <ø> (ø)
src/errors/types.rs 84.14% <75.00%> (-0.10%) ⬇️
src/input/return_enums.rs 86.34% <98.96%> (+1.26%) ⬆️
python/pydantic_core/core_schema.py 96.72% <100.00%> (+0.02%) ⬆️
src/validators/frozenset.rs 100.00% <100.00%> (ø)
src/validators/list.rs 100.00% <100.00%> (ø)
src/validators/set.rs 100.00% <100.00%> (ø)
src/validators/tuple.rs 94.95% <100.00%> (+0.14%) ⬆️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5ef7af...118c5c3. Read the comment docs.

@codspeed-hq
Copy link

codspeed-hq bot commented Jul 25, 2023

CodSpeed Performance Report

Merging #820 will not alter performance

Comparing rockisch:unique-check (118c5c3) with main (f5ef7af)

Summary

✅ 135 untouched benchmarks

Comment on lines 164 to 168
let mut set = AHashSet::with_capacity($data.len());
for $item in $data.iter() {
set.insert($item_py.hash()?);
}
set.len() != $data.len()
Copy link
Member

@adriangb adriangb Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe. You can't just compare the hash.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as I mentioned on the PR comment, I think taking python's notion of uniqueness - which just compares hashes - would be an ok behavior. Otherwise, an hypothetical validation of conset(unique=True) over a value [1, True] would pass, but would return {1}, which just seems weird.

Copy link
Author

@rockisch rockisch Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I could take type into consideration, which would result in the above example passing, but that would be weird IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "python's notion of uniqueness - which just compares hashes"? Two items are not considered the same just because they have the same hash — they are also checked for equality if they do have the same hash.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code example demonstrating:

class A:
    def __hash__(self):
        return 0


class B:
    def __hash__(self):
        return 0


my_dict = {A(): 1, B(): 2, 0: 3}

for k, v in my_dict.items():
    print(hash(k), k, v)

"""
0 <__main__.A object at 0x1052a1d90> 1
0 <__main__.B object at 0x1052a1dd0> 2
0 0 3
"""

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean, yeah, my bad. I'll change the code to reflect that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, even having same hash and same type does not guarantee equality:

...
my_dict = {A(): 1, B(): 2, 0: 3, A(): 4}

for k, v in my_dict.items():
    print(hash(k), k, v)

"""
0 <__main__.A object at 0x1052a1d90> 1
0 <__main__.B object at 0x1052a1dd0> 2
0 0 3
0 <__main__.A object at 0x100945e10> 4
"""

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, __eq__ can also get overridden anyway. I thought the 'this is not safe' comment was mainly related to Python treating a few different types as equal (True == 1), but I get it now. I already have an idea of how I can fix this, I'll try to work on it later today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, as far as I normally think of "uniqueness" in Python, 1 and True are equivalent, even if it seems a little counterintutive.

>>> {1, True}
{1}

I would wager the only way this can be implemented correctly is by using Python hashing + equality.

@rockisch
Copy link
Author

Sorry for initially sending the PR without considering __eq__, totally escaped my mind. It should be implemented properly in the last commit.

This will still consider 1 and True the same (given 1 == True), but now we do check __eq__ before saying a list is non-unique. That said, __hash__ is still used for optimizations. This should be fine, given the description of __hash__ on Python's documentation [1]:

The only required property is that objects which compare equal have the same hash value

Something that I am not currently doing, but could be implemented, is to aggregate all non-hashable values under a single hash (say, -1) in the algorithm, so they get compared against all other non-hashable types. This should have little impact on iterator over hashable types (list[str]), but would allow iterators over BaseClass subclasses to be unique cheked, although in a much less efficient manner. I feel like asking the user to implement __hash__ themselves is probably the better approach, but it is a possibility.

@adriangb
Copy link
Member

Is there a huge advantage over doing something like UniqueList = Annotated[List[T], AfterValidator(lambda x: list(set(x))]?

@rockisch
Copy link
Author

rockisch commented Jul 28, 2023

Is there a huge advantage over doing something like UniqueList = Annotated[List[T], AfterValidator(lambda x: list(set(x))]?

As per #296, the idea of unique is to reject non-unique lists, meaning it would be more akin to:

UniqueList = Annotated[List[T], AfterValidator(lambda x: assert len(set(x)) == len(x)]

The main difference from doing this in Rust would be to be more efficient, given set has some extra overhead. I could run a few benchmarks to compare doing this in Rust vs in Python.

@adriangb
Copy link
Member

You’re going to have a to build a set like thing somewhere. And call Python hash functions, which I’m guessing is the slowest part. Given the niche use case and complexity of this implementation I don’t think it’s worth adding this to pydantic-core, sorry.

@adriangb adriangb closed this Jul 29, 2023
@rockisch
Copy link
Author

rockisch commented Jul 29, 2023

You’re going to have a to build a set like thing somewhere

This is not really true? The whole idea of the algorithm is to do this check without building a python set.

That said, I do agree it's a bit niche. Can the related ticket get closed then?
#296

@grabov
Copy link

grabov commented Aug 8, 2023

@adriangb is there any other way to:

  • Keep the order of items as they were provided
  • Mark this field as unique=True in the schema
  • Keep unique validation logic

In my opinion ordering is quite important thing in many cases, so you can't just replace list[] to set[] for each case. It would be better to let the developer to choice "fast" or "with ordering" for a particular case.

@adriangb
Copy link
Member

adriangb commented Aug 8, 2023

The solution above using an AfterValidator to check for uniqueness can be combined with a WithJsonSchema annotation to set unique=True

@adriangb
Copy link
Member

adriangb commented Aug 9, 2023

Try this:

from typing import Annotated, Hashable, List, TypeVar

from pydantic_core import PydanticCustomError

from pydantic import AfterValidator, Field, ValidationError
from pydantic.type_adapter import TypeAdapter

T = TypeVar('T', bound=Hashable)

def _validate_unique_list(v: list[T]) -> list[T]:
    if len(v) != len(set(v)):
        raise PydanticCustomError('unique_list', 'List must be unique')
    return v

UniqueList = Annotated[List[T], AfterValidator(_validate_unique_list), Field(json_schema_extra={'uniqueItems': True})]


ta = TypeAdapter(UniqueList[int])
assert ta.validate_python([1, 2, 3]) == [1, 2, 3]
try:
    ta.validate_python([1, 2, 2])
except ValidationError as e:
    assert e.errors() == [{'type': 'unique_list', 'loc': (), 'msg': 'List must be unique', 'input': [1, 2, 2]}]
else:
    assert False, 'should have raised'

assert ta.json_schema() == {'items': {'type': 'integer'}, 'type': 'array', 'uniqueItems': True}

@caniko
Copy link

caniko commented Nov 15, 2023

ShouldUniqueList, UniqueTuple, UniqueDeque, etc be in Pydantic?

@ArazHeydarov
Copy link

I don't know if this is the right place to ask this question. I looked through many PR requests about this unique property both in v1 and v2. But, all solutions depend on the equality of objects (or their hashes) inside the list.
I wonder how we can provide uniqueness of objects inside the considering only one or more fields to be unique.

For example:

from typing import List
from pydantic import BaseModel, EmailStr, Field

class User(BaseModel):
    email: EmailStr
    username: str
    name: str


class Form(BaseModel):
    users: List[User] = Field(unique_field=['email', 'username'])

What is the cleanest way to implement such a Form model where User.email and User.username should be unique across the list?

@rlthompson-godaddy
Copy link

Interestingly #820 (comment) causes the ValidationError message to change when you use something like:

Annotated[UniqueList[Path] | None, Field(min_length=1)]
...
Value should have at least 1 item after validation, not 0 [type=too_short, input_value=[], input_type=list]

vs:

Annotated[list[Path] | None, Field(min_length=1)]
...
List should have at least 1 item after validation, not 0 [type=too_short, input_value=[], input_type=list]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unique constraint to list?
8 participants