Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORM: Use pydantic to specify a schema for each ORM entity #6255

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jan 18, 2024

No description provided.

@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 3 times, most recently from 3848afd to c253259 Compare January 22, 2024 12:53
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 2 times, most recently from 49379d9 to d1659b4 Compare March 12, 2024 07:00
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 3 times, most recently from a7a6065 to 069bd1f Compare March 20, 2024 16:45
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 2 times, most recently from 655fcf0 to d778da8 Compare March 25, 2024 12:35
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 07f3001 to 054c43c Compare March 27, 2024 21:55
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 054c43c to 1c65856 Compare April 19, 2024 12:15
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 7c7ef64 to 1e7f536 Compare April 19, 2024 12:58
@danielhollas
Copy link
Collaborator

@sphuber just wondering, have you checked how this affects performance?
Does it influences import aiida.orm at all?

mikibonacci added a commit to mikibonacci/aiida-atomistic that referenced this pull request May 8, 2024
This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255
def serialize_computer(self, computer: Computer, _info):
return computer.label

# @model_validator(mode='after') # type: ignore[misc]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like some code that was forgotten

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking out this PR @agoscinski . For some background info: I commented this out on purpose for now. Ideally, this check should be there because when storing the code, this condition should be satisfied. Unfortunately, however, this now also prevents just instantiating this model that validates the check, even if we don't plan on storing it. This is problematic for the export behavior, because that would rely on instantiating a Model that violates this uniqueness rules, even though in this case it doesn't apply.

The old CLI verdi code create core.code.installed would also rely on something like this check to immediately return an error as soon as label was specified in interactive mode and that label already existed for the same computer. It would then reprompt the user asking for a different label. I haven't been able to reimplement this feature in this PR so was keeping the comment as a reminder.

Safe to say that this PR is not completely done yet 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still "in progress"?

@@ -132,6 +132,17 @@ requires-python = '>=3.9'
'process.workflow.workchain' = 'aiida.orm.nodes.process.workflow.workchain:WorkChainNode'
'process.workflow.workfunction' = 'aiida.orm.nodes.process.workflow.workfunction:WorkFunctionNode'

[project.entry-points.'aiida.orm']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to make orm pluggable, what is the use case of it?

@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from a7c6c13 to 9816200 Compare July 8, 2024 07:27
Copy link

codecov bot commented Jul 8, 2024

Codecov Report

Attention: Patch coverage is 94.48276% with 24 lines in your changes missing coverage. Please review.

Project coverage is 77.97%. Comparing base (ef60b66) to head (d4c4d94).
Report is 110 commits behind head on main.

Files Patch % Lines
src/aiida/orm/entities.py 89.10% 6 Missing ⚠️
src/aiida/orm/comments.py 70.59% 5 Missing ⚠️
src/aiida/orm/authinfos.py 86.96% 3 Missing ⚠️
src/aiida/orm/fields.py 81.25% 3 Missing ⚠️
src/aiida/common/pydantic.py 80.00% 2 Missing ⚠️
src/aiida/orm/nodes/node.py 93.11% 2 Missing ⚠️
src/aiida/cmdline/groups/dynamic.py 87.50% 1 Missing ⚠️
src/aiida/orm/groups.py 94.12% 1 Missing ⚠️
src/aiida/orm/nodes/data/singlefile.py 93.34% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6255      +/-   ##
==========================================
+ Coverage   77.51%   77.97%   +0.47%     
==========================================
  Files         560      566       +6     
  Lines       41444    42279     +835     
==========================================
+ Hits        32120    32964     +844     
+ Misses       9324     9315       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@khsrali
Copy link
Contributor

khsrali commented Jul 11, 2024

It would be nice to have some benchmarks,
Just to see if these changes would slow down AiiDA and if, how much. So to evaluate pros and cons.

agoscinski pushed a commit to aiidateam/aiida-atomistic that referenced this pull request Jul 17, 2024
This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255
mikibonacci added a commit to aiidateam/aiida-atomistic that referenced this pull request Jul 18, 2024
* Introducing the kind checks.
In principle, we can also not define the kinds, in that case it is up to the plugin
to use the get_kinds() method of the StructureData to define kinds.

* Starting to migrate to mkdocs

* StructureData developer guide,
missing: PropertyCollector and properties.

* First profiling for the atomistic vs orm StructureData generation.
Need more profiling for low-atoms systems, where it seems orm is better

* Comment on profiling

* Additional notes on profilings

* Adding a last scaling test, to be commented and inserted in the github issue

* Adding the automatic kind generation.
Missing: deactivate this automatism.

* Adding the `allow_kinds` input in the StructureData init.
used to deactivate the automatic kind generation.

* Adding the `to_legacy_structuredata` method in atomistc.StructureData

This is provided in temporary way in order to make easier to
migrate the plugin gradually to this new Data.

In a first step, we should use this method anytime the plugin uses
the StructureData. Then add the properties setting.

* - Adding the developer docs
- Adding some bugfixing for the empty StructureData init.

* Added user guide with examples

* Fixing #16

* This solves issue #10

This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255

* New implementation, following the pydantic PR.

adding the `charges` properties, in the same way as we have the cell and
pbc: hard-coding the getter and setter methods.

For now, no data validation is there.

* Added the support to read charges also from ASE Atoms object.
Added the map_kinds method, which return the mapping for the kinds as
contained in the sites attribute.

* Reverting back to non-pydantic implementation (Just removing the Model
attribute from the StructureData class, nothing else changed)

Added also the charges in the get_ase method.

* Attaching the properties to the Sites.

- code refactored: now `StructureData` and `Site` classes are in separated files, as well as the utils; `Kind` class is removed
- `Site` class now contains the properties: `symbol`, `position`, `mass`, `charge`, `magnetization`, `kind_name`; `kind_name` should be removed. The properties are defined via the `property` decorator
- `from_ase`, `to_ase` methods are added
- `from_pymatgen`, `to_pymatgen` methods are added (**TOBE fixed to add also the magnetization**)
- `from_file`, `to_file` methods are added. They rely on ASE `io.read` and `io.write` functions.
- the `orm.StructureData` (the old one) now has a method `to_atomistic`
- the `utils.py` file contains also the `to_kinds` and `get_kinds` functions, to automatically generate kinds. The user should only use the `get_kinds` function. **TOBE refined: kind names, and remove the other kinds methods from the StructureData class.**
- the `StructureData` constructor now works in this way: we provide the pbc, cell. Then we should use the `append_atom` method afterwards. **TOBE discussed, I guess we cann add lists contaning site-wise value for properties**
- slicing of structure data: defined the __getitem__ method, to obtain a sliced structuredata instance from the initial one.
- TOBE added: `to_supercell` method.

* adding the possiblity to provide lists in the constructor BUT we are
gonna change this by providing the possibility to provide list of sites
(each of them is a dictionary), in the same format as they are
represented in the database

* Moving the get_kinds and _to_kinds (now a private method) into the
StructureData class.

* Removed kind related methods which now are no more needed in the
StructureData

* Utilities for cell and sites update

* New StructureData and StructureDataMutable

both inheriting from StructureDataCore.
Added properties:
- charge
- magmom

Relevant added methods:

- from/to ASE/Pymatgen
- to_legacy
- to_structuredata
- to_mutable_structuredata

Added single page for API tutorial.

Added tests for basic functionalities of both Structure Classes.

* Allowing also magmom to be provided as floats: they
are then stored as [magmom,0,0]

* Improving mutability and minor changes.

Mutability is improved by removing all the setter methods, both in
StructureDataCore and Site classes. Moreover, lists and tuples are
returned as np.array with flags.writeable=False, meaning that we cannot
even modify the internal arrays/lists.

Other minor changes:

- Now in StructureDataCore we have to define the pbc, cell and sites
keywords to initialise it (and so the other subclasses), this is
helpful to then document the inputs.
- `get_kinds` routine fixed to work also with magmoms (arrays)
- docu and tests updated with respect to these changes.

* Adding type hinting and minor fixes

removing mutable attribute in Site class
adding a get_global_properties method in StructureDataCore,
not used now but in the future to have a list of the properties in the
database (not only attached to the single sites), to make easy to query
wrt a property.

* Improving flexibility, automatic get_kinds and pymatgen fixings.

Providing full flexibility in the StructureDataMutable while preserving same data structure of StructureData.

- adding the parent and index when we initialise the Site class (i.e. when we access structure.sites). This is helpful to be able to change directly properties in the StructureDataMutable class.
- in the same direction, we implemented the ObservedArray class which basically allows to trigger the setter methods of arrays/lists even if we modify only one element (for example: structure.sites[0].position[0] = 5). In this way, changes are written into the structure._data, which represents the real data stored in the instance.

`get_kinds` method now is stable.

Fixing pymatgen problems with oxi_state and magmom.

* Adding in the tutorial markdown also the get_kinds method, and minimal
update of the pytests.
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 9816200 to d4c4d94 Compare August 9, 2024 07:56
prefixes = ('aiida.orm.nodes.',)
prefixes = ('aiida.orm.nodes.', 'aiida.orm.core.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the changes of adding the aiida.orm entry point group and making it possible to load the ORM classes through an entry point. But I will have to go back to the PR and actually test to see if this is still necessary

Comment on lines +193 to +200
@property
def sealed(self) -> bool:
return self.base.attributes.get(self.SEALED_KEY, False) # type: ignore[attr-defined]

@property
def is_sealed(self) -> bool:
"""Returns whether the node is sealed, i.e. whether the sealed attribute has been set to True."""
return self.base.attributes.get(self.SEALED_KEY, False) # type: ignore[attr-defined]
return self.sealed
Copy link
Member

@edan-bainglass edan-bainglass Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping both for backwards compatibility, or is there a different reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keeping it for backwards compatibility. In the model, I define the field as sealed so the (de)serialization code expects a sealed property, which is why I add it. I could have also chosen to not add the property and just name the field is_sealed instead. Or, we could keep sealed as the field and then define the model_to_orm attribute as model_to_orm: lambda model: model.is_sealed.

Comment on lines 184 to +187
SEALED_KEY = 'sealed'

__qb_fields__ = [
add_field(
SEALED_KEY,
dtype=bool,
doc='Whether the node is sealed',
),
]
class Model(pydantic.BaseModel):
sealed: bool = MetadataField(description='Whether the node is sealed')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the SEALED_KEY still in use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is now used in the new sealed property

@@ -375,7 +377,7 @@ def set_file(self, file, filename=None):
from aiida.common.exceptions import ParsingError
from aiida.common.files import md5_file, md5_from_filelike

parsed_data = parse_upf(file)
parsed_data = parse_upf(file, check_filename=self.CHECK_FILENAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@sphuber
Copy link
Contributor Author

sphuber commented Aug 21, 2024

Thanks for the review @edan-bainglass . I see that most your comments are questions. This is fair enough the PR doesn't provide any description of the changes and the reasoning behind it. Since most of your questions all refer to the same concept and so would be answered by the same answer, instead of going through them one by one, I will answer it in general here.

The whole point of this PR is to provide a way that any ORM class can automatically be serialized to JSON and the same instance can be recovered from that serialized JSON payload. This is implemented in the aiida.orm.entities.Entity class, which is the base class for all ORM classes. It provides the serialize and from_serialized methods for this purpose.

Instead of manually implementing JSON (de)serializing for all classes, we use pydantic. Each ORM class has to define a pydantic.BaseModel that captures their "schema". This is the part that replaces the QbFields that you added. The associated functionality is kept but the logic now parses the pydantic model instead of the custom QbField syntax.

For all of this to work, we need to be able to clean map an instance of a pydantic.BaseModel to an instance of the associated ORM class and vice versa. This essentially means that the latter's constructor needs to accept all the fields defined by the model and it needs to provide properties to retrieve the value for each field from an ORM instance. This is necessary because the implementation of from_model and to_model in the Entity class depends on it. I could have chosen to just implement this manually on each class and use the existing properties or go directly through the attributes (in the case of Node). But I want it to be automatic, because then plugin developers don't have to worry about it but they get it out of the box (since their plugin will have Entity as base class, and there it is automated). This is why you see all these changes that add arguments/logic to the constructor and add properties to retrieve field values.

I have been thinking whether those changes should be in separate commits. If we take the example of the AuthInfo, one of its fields is metadata but the constructor currently doesn't allow setting it. Same goes for extras of the Node. But why would that not be allowed? It is arbitrary to allow some fields to be defined, but require others to be set after the instance is created through some other setters. If you think this would help, I guess I could still refactor the PR to factor out those changes into an initial commit.

The changes to the entry point handling, and adding aiida.orm as a entry point group, essentially is to make it possible to actually fully serialize all ORM classes. Certain ORM classes already have a string identifier in the form of an entry point, namely the Data class and all its subclasses, i.e. data plugins. They have an entry point in aiida.data. But a class like Computer does not. If you need to communicate to what class a specific JSON serialized instance belongs, you could just use 'Computer' as a string, but it would be nicer and more consistent if it would be communicated through an entry point like other classes, e.g. aiida.orm:core.computer. This again will make implementing a REST API that can (de)serialize ORM instances out of the box trivial.

@edan-bainglass
Copy link
Member

edan-bainglass commented Aug 22, 2024

@sphuber first, thank you so much for the detailed explanation of the PR. The scope/depth of change + uncharacteristic lack of explanation was a recipe for confusion. Note that at the time of you writing the above, I was only "half"-way through my review, as I had started it quite late in the day. I'm wrapping up the review at the moment, but with the above clarifications, the PR is now mostly crystal clear! And I think you can skip PR refactoring. It all makes sense now as a whole.

Thanks for implementing this! It is pertinent to my current use case (query builder web app), which requires JSON (de)serialization of nodes via REST API calls. I believe you had mentioned before when we worked on the fields PR that this would precisely be where these changes would make an impact. Looking forward to this dropping!

Copy link
Member

@edan-bainglass edan-bainglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber thanks again for implementing this. As mentioned in my other comment, it is mostly clear. Left a few comments after your post that I think would still benefit from clarification. Other than that, LGTM!

:param model_to_orm: Optional callable to convert the value of a field from a model instance to an ORM instance.
:param exclude_to_orm: When set to ``True``, this field value will not be passed to the ORM entity constructor
through ``Entity.from_model``.
:param exclude_to_orm: When set to ``True``, this field value will not be exposed on the CLI command that is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude_from_cli

Comment on lines +80 to +81
node: int = MetadataField(
description='Node PK that the comment is attached to',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node as an int type might be confusing, no? One might expect a Node type. Was node_pk, so I guess you had a reason to go with just node?

Comment on lines +60 to +61
user: int = MetadataField(
description='The PK of the user',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating a comment from comments.py, but again, user here is the PK, but in the __init__, user comes in as User and when PK is needed -> user.pk. What was your reasoning for dropping _pk in the models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another generic "problem" that arises due to the discrepancies in field types when comparing an ORM instance and its equivalent pydantic model. The Python ORM deals with instances of the ORM, i.e. an AuthInfo takes a User instance, even though in the database layer the link is represented by the pk of the User instance. Similarly, when serializing an AuthInfo to JSON, the User field also needs to be serialized, which we do by returning its pk as an integer.

From the ORM's perspective, it makes more sense to work with ORM instances, i.e. it is easier for the user (I would argue) for the constructor to request a User instance for the user argument as opposed to its pk. Same goes for other ORM classes like Computer etc. But the pydantic model is outward facing where the User class makes no sense. So the pydantic model instead defines the user field as its serialized data type, i.e. an int. When a user will create a new AuthInfo over a REST API, they will be specifying its user by its pk and not through a User instance.

This discrepancy between the types of the ORM class constructor and its pydantic model will pop up in various places. That is why these orm_class and orm_to_model attributes are there to allow the (de)serialization code in Entity to convert between the two. We could choose to make it more consistent by changing the constructors to not take ORM class instances but always work with their pks. However this carries the downsides:

  • it makes the ORM less intuitive to work with (IMHO)
  • it is backwards incompatible

For the last point, it is perhaps possible to provide a deprecation pathway but I think this will still introduce a lot of hassle for users. That is why I opted for the current solution and keep the slight discrepancy between pydantic model and the ORM class constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally understand, though I was more commenting on the model key names - user instead of user_pk - and not on the types - int vs. User, the latter I understand from your comment 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more commenting on the model key names - user instead of user_pk - and not on the types - int vs. User

I see. The reason I didn't name it user_pk is because the serializer on the pydantic model will already convert it to an actual user. So if you do pydantic_comment = Comment.Model(user=1) then assert isinstance(pydantic_comment.user, User) will pass. In other words, the user field on a pydantic model instance will return a User instance and not the integer, but an integer is required to construct the model instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for clarifying 🙂

@@ -88,29 +88,25 @@ def get_command(self, ctx: click.Context, cmd_name: str) -> click.Command | None
command = super().get_command(ctx, cmd_name)
return command

def call_command(self, ctx, cls, **kwargs):
def call_command(self, ctx, cls, non_interactive, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you briefly explain non_interactive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic option that is used in all interactive verdi commands. It is the -n/--non-interactive flag, and when specified it forces the command to never prompt, which is necessary for use in non-interactive scripts for example.

@sphuber
Copy link
Contributor Author

sphuber commented Aug 22, 2024

@edan-bainglass one reason why I hadn't (yet) written an extensive documentation in the commits here, is because I actually put this in an AEP. There you will find an extensive description of the goal, design, implementation and considerations. I also provide examples of how these changes make it then trivial to write a REST API around it. I provide a working example with FastAPI.

As I mentioned in a comment elsewhere, I will quickly revisit this and see if the changes to the plugin loading system are still required for the REST API use case. If not, I will revert those. I would encourage yourself to also start testing this PR out with your envisioned use case to see what works and what doesn't and what might be missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants