I recommend using the psmiles
Python package that integrates canonicalization and other tools to work with PSMILES.
PSMILES (Polymer SMILES) is a chemical language to represent polymer structures. PSMILES strings have two stars ([*]
or *
) symbols that indicate the two endpoints of the polymer repeat unit and otherwise follow the daylight SMILES syntax defined at OpenSmiles. Developed as part of arXiv.
The raw PSMILES syntax is ambiguous and non-unique; i.e., the same polymer may be written using many PSMILES strings:
Polyethylene | Polyethylene oxide | Polypropylene |
---|---|---|
[*]C[*] |
[*]CCO[*] |
[*]CC([*])C |
[*]CC[*] |
[*]COC[*] |
[*]CC(CC([*])C)C |
[*]CCC[*] |
[*]OCC[*] |
CC([*])C[*] |
The canonicalization routine of the PSMILES
packages finds a canonicalized version of the SMILES string by
- Finding the shortest representation of a PSMILES string
[*]CCOCCO[*]
-> [*]CCO[*]
- Making the PSMILES string cyclic
[*]CCO[*]
-> C1 CCO C1
- Applying the canonicalization routine as implemented in RDKit
C1 CCO C1
-> C1 COC C1
- Breaking the cyclic bond
C1 COC C1
-> [*]COC[*]
pip install git+https://github.com/Ramprasad-Group/canonicalize_psmiles.git
See also test.ipynb
from canonicalize_psmiles.canonicalize import canonicalize
smiles = "[*]NC(C)CC([*])=O"
print(smiles)
print(canonicalize(smiles))