-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse complex chemical formula strings #3954
Conversation
The "parse_formula" function can now parse more complex formulas, including elements with partial occupancies and groups of elements enclosed in parentheses. Spaces separators are no more needed. Elements are counted and a dictionary is returned. This can be useful to parse the formulas included in cif files, or those generated by pymatgen. e.g. C[NH2]3NO3 --> {'C': 1, 'N': 4, 'H': 6, 'O': 3}
Codecov Report
@@ Coverage Diff @@
## develop #3954 +/- ##
===========================================
+ Coverage 78.45% 78.45% +0.01%
===========================================
Files 461 461
Lines 34079 34094 +15
===========================================
+ Hits 26733 26746 +13
- Misses 7346 7348 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Loris! What I would like to ask is for you to add some unit tests, to make sure that the implementation works as expected and will continue to do so in the future. To help you along your way, I have made a start. Just create the file tests/orm/data/test_cif.py
and put the following content in there:
import pytest
from aiida.orm.nodes.data.cif import parse_formula
def test_parse_formula():
"""Test the `parse_formula` utility function."""
assert parse_formula('C H') == {'C': 1, 'H': 1}
assert parse_formula('C5 H1') == {'C': 5, 'H': 1}
assert parse_formula('Ca5 Ho') == {'Ca': 5, 'Ho': 1}
assert parse_formula('H0.5 O') == {'H': 0.5, 'O': 1}
assert parse_formula('C0 O0') == {'C': 0, 'O': 0}
assert parse_formula('C1 H1 ') == {'C': 1, 'H': 1}
assert parse_formula(' C1 H1') == {'C': 1, 'H': 1}
assert parse_formula('C[NH2]3NO3') == {'C': 1, 'N': 4, 'H': 6, 'O': 3}
with pytest.raises(ValueError):
parse_formula('H0.5.2 O')
This should give you an idea how to write tests. You should add some more with specific cases that you think this implementation should cover.
Most of these tests already existed in another file, tests/test_dataclasses.py
line 578 to 594. These were in the old style, but I have now translated them and added one example that you described in the docstring. You can delete those lines now and add some more tests in the file you will create.
After having created the file, you can run the tests by executing pytest tests/orm/data/test_cif.py
.
If you run into trouble let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great with a little love to this part of the code! Thanks @lorisercole!
I have some minor beauty-suggestions, not really related to the content and logic as a whole, but still relevant.
I haven't tested your regular expressions extensively, however, maybe you could somewhat do this yourself, i.e., create a pytest test for this function that goes through a list of varying formulas and check that what the function returns is also what you would expect.
Both for valid and invalid formulas.
Apply suggestions by @CasperWA
OK, thanks guys. In the commit e0c6704ee35d7820d8084a10fe807f964b514f0b I also edited the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvements @lorisercole . There are a few minor problems in the tests still, but mostly I am wondering if we should deviate from the CIF standard as this PR does. @giovannipizzi what do you think? Is it okay to support tags and tag formats that are invalid according to the CIF spec? See my comments in the PR for details
Test the chemical formula parser against many cumbersome cases.
It is possible to specify 'custom_tags' that will try to be read by the 'CifData.get_formulae' method.
e0c6704
to
e01e66f
Compare
Thanks. I updated the branch following your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the improvements @lorisercole !
Is there somewhere a specification of the "more complex" formula format we are supporting here? |
More (weird) examples in the tests: aiida-core/tests/orm/data/test_cif.py Lines 17 to 54 in 862fa20
|
Thanks, so I see it includes square brackets, curly braces as well as round braces, with and without spaces; integers, floating point numbers etc. I'm just mentioning this because it can be very useful to tell people that this field is guaranteed to support formulas of format X. Ideally X is fully specified in itself; a second-best option would be X = specification A + features B,C |
Yes it supports round, square, and curly brackets.
The tests contain intentionally more general/complex cases than one should ever find (this would even allow one to parse formulas written "by hand" in weird ways, for whatever reason).
|
The "parse_formula" function can now parse more complex formulas, including elements with partial occupancies and groups of elements enclosed in parentheses.
Spaces separators are no more needed.
Elements are counted and a dictionary is returned.
This can be useful to parse the formulas included in cif files (for example to debug a cif parser), or those generated by pymatgen.
e.g. C[NH2]3NO3 --> {'C': 1, 'N': 4, 'H': 6, 'O': 3}