GitHub - knu-lcbc/MolForge at v1.0.0

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Figures		Figures
MolForge		MolForge
data		data
.gitignore		.gitignore
Interpretation.ipynb		Interpretation.ipynb
LICENSE.md		LICENSE.md
Main_Results.ipynb		Main_Results.ipynb
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Repository files navigation

Reconstruction of lossless molecular representations.

SMILES is the most dominant molecular representation used in AI-based chemical applications, but it has innate limitations associated with its internal structure. Here, we exploit the idea that a set of structural fingerprints can be used as efficient alternatives to unique molecular representations. For this purpose, we trained neural-machine-translation based models that translate a set of various structural fingerprints to conventional text-based molecular representations, i.e., SMILES and SELFIES. The assessment of their conversion efficiency showed that our models successfully reconstructed molecules and achieved a high level of accuracy. Therefore, our approach brings structural fingerprints into play as strong representational tools in chemical natural language processing applications by restoring the connectivity information that is lost during fingerprint transformation. This comprehensive study addressed the major limitation of structural fingerprints, which precludes their implementation in NLP models. Our findings would facilitate the development of text or fingerprint-based chemoinformatic models for generative and translational tasks.

Code usage

Requirements

The source code is tested on Linux operating systems. After cloning the repository, we recommend creating a new conda environment and install the package locally. Users should install required packages described in environments.yml prior to direct use.

conda env create --name MolForge_env --file=environment.yml
conda activate MolForge_env
pip install .

Prediction & Demo:

First, checkpoints files should be downloaded and extracted.

Run below commands to conduct an inference with the trained model.

predict --fp --model_type --input

--fp: The name of fingerprint.
--model_type: Molecular representation e.g. 'smiles' or 'selfies'
--input: An input for prediction
--checkpoint: Checkpoint file for the given model. If None, it uses the downloaded checkpoints.
--decode: Decoding algorithm (either 'greedy' or 'beam'), (by default: greedy)

Example prediction;

predict --fp='ECFP4' --model_type='smiles' --input='1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970'

and its sample output;

Here we go..
          fp : ECFP4
  model_type : smiles
       input : 1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970
  input_file : None
  checkpoint : saved_models/ECFP4_smiles_checkpoint.pth
      decode : greedy
src_vocab_size : 2052
trg_vocab_size : 109
 src_seq_len : 104
 trg_seq_len : 130
    root_dir : /home/tmp/MolForge
  fp_datadir : /home/tmp/MolForge/data/fingerprints/ECFP4
src_sp_prefix : /home/tmp/MolForge/data/sp/ECFP4_vocab_sp
trg_sp_prefix : /home/tmp/MolForge/data/sp/smiles_vocab_sp
        rank : cuda
      device : cuda

The size of src vocab is 2052 and that of trg vocab is 109.
Loading checkpoint... ECFP4 smiles
Preprocessing input sentence...
Encoding input sentence...
Greedy decoding selected.

Input: 1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970
Result: C C O C 1 = C ( C = C ( C = C 1 ) C ( C ( C ) ( C ) C ) N ) O C C
Inference finished! || Total inference time: 0mins 0secs

Result

Each cell shows the Tanimoto exactness (%) of selected fingerprint transformation to SMILES (row-wise) computed at the respective fingerprint encodings(columns-wise). The consistency in color code reflects the robustness, while the jumps represent the effect of selection bias. ECFP2* and ECFP4* represent explicit bit versions.

	MACCS	Avalon	RDK4	RDK4_L	HashAP	TT	HashTT	ECFP0	ECFP2	ECFP4	FCFP2	FCFP4	AEs	ECFP2*	ECFP4*
MACCS	77.4	33.3	38	39.8	32.2	33.2	33.2	52.2	34.7	32.5	48.6	33.5	34.7	37	33.3
Avalon	72.6	67.9	72.2	73.5	63.4	64.7	64.7	69.5	65.6	63.6	68.9	64.7	65.6	68.5	64.6
RDK4	66.9	60	90.9	91.5	59.8	61.1	61.1	62.5	60.2	58.3	62.3	59.6	60.2	64.3	59.6
RDK4_L	52.6	46.7	64.7	88.8	46.7	47.7	47.7	49.1	46.9	45.5	48.8	46.5	46.9	49.3	46.2
HashAP	86.5	83.8	89.6	90.2	85.2	85.5	85.5	84.3	83.1	82.5	84	82.8	83.1	86.1	84.1
TT	88.4	83.5	92.3	92.5	84.1	87.3	87.3	85.8	85.2	82.3	85.7	83.8	85.2	91.4	84.2
HashTT	86.2	81.4	90.2	90.5	82.1	85.3	85.5	83.9	83.3	80.4	83.8	81.8	83.3	89.2	82.2
ECFP0	3.3	1.3	2.1	2.7	1.2	1.3	1.3	4	1.4	1.2	2.9	1.3	1.4	1.8	1.4
ECFP2	86	75.8	83.1	83.1	73.6	76	76	84.7	82.7	74.4	84.5	76.5	82.7	96.2	76
ECFP4	95.1	92.6	95.7	95.7	90.8	92.4	92.4	93.5	93.1	92.1	93.3	92.4	93.1	96.6	94.8
FCFP2	25.6	16.3	20.1	21.6	15.5	16	16	28.6	16.9	15.7	38.7	20.4	16.9	19.6	16.1
FCFP4	71.5	67.5	73.7	73.8	65.5	67.3	67.3	69.2	68.5	66.3	87.6	86.7	68.5	74.4	68.1
AEs	86.7	76.2	83.5	83.6	74	76.3	76.3	85.3	83.5	74.7	85.2	76.8	83.5	97	76.5

For more results see the Main_Results.ipynb notebook.

Cite

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reconstruction of lossless molecular representations.

Code usage

Requirements

Prediction & Demo:

Result

Cite

License

About

Releases 1

Packages

Contributors 2

Languages

License

knu-lcbc/MolForge

Folders and files

Latest commit

History

Repository files navigation

Reconstruction of lossless molecular representations.

Code usage

Requirements

Prediction & Demo:

Result

Cite

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages