Parse complex chemical formula strings #3954

lorisercole · 2020-04-21T18:23:24Z

The "parse_formula" function can now parse more complex formulas, including elements with partial occupancies and groups of elements enclosed in parentheses.
Spaces separators are no more needed.
Elements are counted and a dictionary is returned.
This can be useful to parse the formulas included in cif files (for example to debug a cif parser), or those generated by pymatgen.

e.g. C[NH2]3NO3 --> {'C': 1, 'N': 4, 'H': 6, 'O': 3}

The "parse_formula" function can now parse more complex formulas, including elements with partial occupancies and groups of elements enclosed in parentheses. Spaces separators are no more needed. Elements are counted and a dictionary is returned. This can be useful to parse the formulas included in cif files, or those generated by pymatgen. e.g. C[NH2]3NO3 --> {'C': 1, 'N': 4, 'H': 6, 'O': 3}

codecov · 2020-04-21T18:36:52Z

Codecov Report

Merging #3954 into develop will increase coverage by 0.01%.
The diff coverage is 87.88%.

@@             Coverage Diff             @@
##           develop    #3954      +/-   ##
===========================================
+ Coverage    78.45%   78.45%   +0.01%     
===========================================
  Files          461      461              
  Lines        34079    34094      +15     
===========================================
+ Hits         26733    26746      +13     
- Misses        7346     7348       +2

Flag	Coverage Δ
#django	`70.50% <87.88%> (+0.01%)`	⬆️
#sqlalchemy	`71.35% <87.88%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
aiida/orm/nodes/data/cif.py	`86.58% <87.88%> (-0.54%)`	⬇️
aiida/engine/daemon/client.py	`73.72% <0.00%> (+1.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a038876...11bf5c8. Read the comment docs.

sphuber

Thanks a lot Loris! What I would like to ask is for you to add some unit tests, to make sure that the implementation works as expected and will continue to do so in the future. To help you along your way, I have made a start. Just create the file tests/orm/data/test_cif.py and put the following content in there:

import pytest

from aiida.orm.nodes.data.cif import parse_formula


def test_parse_formula():
    """Test the `parse_formula` utility function."""
    assert parse_formula('C H') == {'C': 1, 'H': 1}
    assert parse_formula('C5 H1') == {'C': 5, 'H': 1}
    assert parse_formula('Ca5 Ho') == {'Ca': 5, 'Ho': 1}
    assert parse_formula('H0.5 O') == {'H': 0.5, 'O': 1}
    assert parse_formula('C0 O0') == {'C': 0, 'O': 0}
    assert parse_formula('C1 H1 ') == {'C': 1, 'H': 1}
    assert parse_formula(' C1 H1') == {'C': 1, 'H': 1}
    assert parse_formula('C[NH2]3NO3') == {'C': 1, 'N': 4, 'H': 6, 'O': 3}

    with pytest.raises(ValueError):
        parse_formula('H0.5.2 O')

This should give you an idea how to write tests. You should add some more with specific cases that you think this implementation should cover.

Most of these tests already existed in another file, tests/test_dataclasses.py line 578 to 594. These were in the old style, but I have now translated them and added one example that you described in the docstring. You can delete those lines now and add some more tests in the file you will create.

After having created the file, you can run the tests by executing pytest tests/orm/data/test_cif.py.

If you run into trouble let me know.

CasperWA

Great with a little love to this part of the code! Thanks @lorisercole!

I have some minor beauty-suggestions, not really related to the content and logic as a whole, but still relevant.

I haven't tested your regular expressions extensively, however, maybe you could somewhat do this yourself, i.e., create a pytest test for this function that goes through a list of varying formulas and check that what the function returns is also what you would expect.
Both for valid and invalid formulas.

aiida/orm/nodes/data/cif.py

@CasperWA

Apply suggestions by @CasperWA

lorisercole · 2020-04-23T10:18:02Z

OK, thanks guys.
I implemented the suggestions by @CasperWA and added some tests.
The parser should work for the chemical formulas that can be read from COD, ICSD, MPDS cif files, as well as those generated by the many modes of StructureData.get_formula, by pymatgen, and ase.

In the commit e0c6704ee35d7820d8084a10fe807f964b514f0b I also edited the function CifData.get_formulae to be able to read the chemical formulas of MPDS cif files, which use a different tag (only available for v1.0.5+ files, recently updated).

sphuber

Thanks for the improvements @lorisercole . There are a few minor problems in the tests still, but mostly I am wondering if we should deviate from the CIF standard as this PR does. @giovannipizzi what do you think? Is it okay to support tags and tag formats that are invalid according to the CIF spec? See my comments in the PR for details

tests/orm/data/test_cif.py

aiida/orm/nodes/data/cif.py

Test the chemical formula parser against many cumbersome cases.

It is possible to specify 'custom_tags' that will try to be read by the 'CifData.get_formulae' method.

lorisercole · 2020-05-06T09:47:48Z

Thanks. I updated the branch following your suggestions.

sphuber

Thanks a lot for the improvements @lorisercole !

Changes addressed

ltalirz · 2020-07-01T14:39:26Z

Is there somewhere a specification of the "more complex" formula format we are supporting here?

lorisercole · 2020-07-01T14:52:50Z

@ltalirz see 9ead2b1:

The "parse_formula" function can now parse more complex formulas,
including elements with partial occupancies and groups of elements enclosed
in parentheses. Spaces separators are no more needed.
Elements are counted and a dictionary is returned.
This can be useful to parse the formulas included in cif files, or those
generated by pymatgen.

e.g.  C[NH2]3NO3  -->  {'C': 1, 'N': 4, 'H': 6, 'O': 3}

More (weird) examples in the tests:

aiida-core/tests/orm/data/test_cif.py

Lines 17 to 54 in 862fa20

    
           def test_parse_formula(): 
        
               """Test the `parse_formula` utility function.""" 
        
               assert parse_formula('C H') == {'C': 1, 'H': 1} 
        
               assert parse_formula('C5 H1') == {'C': 5, 'H': 1} 
        
               assert parse_formula('Ca5 Ho') == {'Ca': 5, 'Ho': 1} 
        
               assert parse_formula('H0.5 O') == {'H': 0.5, 'O': 1} 
        
               assert parse_formula('C0 O0') == {'C': 0, 'O': 0} 
        
               assert parse_formula('C1 H1 ') == {'C': 1, 'H': 1} 
        
               assert parse_formula(' C1 H1') == {'C': 1, 'H': 1} 
        
               assert parse_formula('CaHClO') == {'Ca': 1, 'H': 1, 'Cl': 1, 'O': 1} 
        
               assert parse_formula('C70 H108 Al4 La4 N4 O10') == {'C': 70, 'H': 108, 'Al': 4, 'La': 4, 'N': 4, 'O': 10} 
        
               assert parse_formula('C70H108Al4Li4N4O10') == {'C': 70, 'H': 108, 'Al': 4, 'Li': 4, 'N': 4, 'O': 10} 
        
               assert parse_formula('C36 H59LiN2 O3 Si') == {'C': 36, 'H': 59, 'Li': 1, 'N': 2, 'O': 3, 'Si': 1} 
        
               assert parse_formula('C63.5H83.5Li2N2O3.25P2') == {'C': 63.5, 'H': 83.5, 'Li': 2, 'N': 2, 'O': 3.25, 'P': 2} 
        
               assert parse_formula('Fe Li0.667 O4 P1') == {'Fe': 1, 'Li': 0.667, 'O': 4, 'P': 1} 
        
               assert parse_formula('Fe2.05Ni0.05O4 Zn0.9') == {'Fe': 2.05, 'Ni': 0.05, 'O': 4, 'Zn': 0.9} 
        
               assert parse_formula('Li3O6(Al0.23Li0.77)2(Li0.07Te0.93)') == {'Li': 4.61, 'O': 6, 'Al': 0.46, 'Te': 0.93} 
        
               assert parse_formula('Li2{Cr0.05Li0.95X0.00}{Cr0.24Li0.76}2{Li0.02Te0.98}O6') == { 
        
                   'Li': 4.49, 
        
                   'Cr': 0.53, 
        
                   'X': 0, 
        
                   'Te': 0.98, 
        
                   'O': 6 
        
               } 
        
               assert parse_formula('C[NH2]3NO3') == {'C': 1, 'N': 4, 'H': 6, 'O': 3} 
        
               assert parse_formula('H80 C104{C0.50 X0.50}8N8 Cl4(Cl0.50X0.50)8.0O8') == { 
        
                   'H': 80, 
        
                   'C': 108.0, 
        
                   'X': 8.0, 
        
                   'N': 8, 
        
                   'Cl': 8.0, 
        
                   'O': 8 
        
               } 
        
               assert parse_formula('Na1.28[NH]0.28{N H2}0.72') == {'Na': 1.28, 'N': 1.0, 'H': 1.72} 
        
               for test_formula in ('H0.5.2 O', 'Fe2.05Ni0.05.4', 'Na1.28[NH]0.28.3{NH2}0.72'): 
        
                   with pytest.raises(ValueError): 
        
                       parse_formula(test_formula)

ltalirz · 2020-07-01T15:00:48Z

Thanks, so I see it includes square brackets, curly braces as well as round braces, with and without spaces; integers, floating point numbers etc.
Is this following a standard specified somewhere or is this something you "invented"?

I'm just mentioning this because it can be very useful to tell people that this field is guaranteed to support formulas of format X. Ideally X is fully specified in itself; a second-best option would be X = specification A + features B,C

lorisercole · 2020-07-02T10:26:24Z

Yes it supports round, square, and curly brackets.

round brackets are used in the chemical formula tags of some CIF files, such as from ICSD
square brackets are found in the chemical formula tags of some CIF files, such as from MPDS
curly brackets are used in the formulas returned by StructureData.get_formula()
Spaces are sometimes used, sometimes not.

The tests contain intentionally more general/complex cases than one should ever find (this would even allow one to parse formulas written "by hand" in weird ways, for whatever reason).
The parser counts all the elements and returns a dictionary of them. If an element is present in several bracketed groups, the sum will be returned, e.g.

parse_formula('Na1.28[NH]0.28{N H2}0.72')  -->  {'Na': 1.28, 'N': 1.0, 'H': 1.72}

lorisercole changed the title ~~Parse complex chemical formulas strings~~ Parse complex chemical formula strings Apr 21, 2020

sphuber self-requested a review April 21, 2020 21:37

sphuber requested changes Apr 21, 2020

View reviewed changes

CasperWA previously requested changes Apr 22, 2020

View reviewed changes

Minor beauty edits to cif.parse_formula

3357d5f

Apply suggestions by @CasperWA

lorisercole requested review from sphuber and CasperWA April 23, 2020 10:18

sphuber requested changes Apr 27, 2020

View reviewed changes

tests/orm/data/test_cif.py Outdated Show resolved Hide resolved

tests/orm/data/test_cif.py Outdated Show resolved Hide resolved

aiida/orm/nodes/data/cif.py Show resolved Hide resolved

aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved

sphuber requested changes Apr 28, 2020

View reviewed changes

aiida/orm/nodes/data/cif.py Show resolved Hide resolved

aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved

lorisercole added 2 commits April 28, 2020 10:42

Add cif.parse_formula test

9da8f68

Test the chemical formula parser against many cumbersome cases.

Add custom formula tags to CifData.get_formulae

e01e66f

It is possible to specify 'custom_tags' that will try to be read by the 'CifData.get_formulae' method.

lorisercole force-pushed the parse_formula branch from e0c6704 to e01e66f Compare April 28, 2020 09:18

lorisercole requested a review from sphuber April 28, 2020 09:22

Merge branch 'develop' into parse_formula

11bf5c8

sphuber approved these changes May 7, 2020

View reviewed changes

sphuber merged commit 6220e85 into aiidateam:develop May 7, 2020

sphuber added this to the v1.3.0 milestone May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse complex chemical formula strings #3954

Parse complex chemical formula strings #3954

lorisercole commented Apr 21, 2020

codecov bot commented Apr 21, 2020 •

edited

Loading

sphuber left a comment

CasperWA left a comment

lorisercole commented Apr 23, 2020

sphuber left a comment

lorisercole commented May 6, 2020

sphuber left a comment

ltalirz commented Jul 1, 2020

lorisercole commented Jul 1, 2020

ltalirz commented Jul 1, 2020 •

edited

Loading

lorisercole commented Jul 2, 2020

Parse complex chemical formula strings #3954

Parse complex chemical formula strings #3954

Conversation

lorisercole commented Apr 21, 2020

codecov bot commented Apr 21, 2020 • edited Loading

Codecov Report

sphuber left a comment

Choose a reason for hiding this comment

CasperWA left a comment

Choose a reason for hiding this comment

lorisercole commented Apr 23, 2020

sphuber left a comment

Choose a reason for hiding this comment

lorisercole commented May 6, 2020

sphuber left a comment

Choose a reason for hiding this comment

ltalirz commented Jul 1, 2020

lorisercole commented Jul 1, 2020

ltalirz commented Jul 1, 2020 • edited Loading

lorisercole commented Jul 2, 2020

codecov bot commented Apr 21, 2020 •

edited

Loading

ltalirz commented Jul 1, 2020 •

edited

Loading