DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis
This repository provides implementations and code examples for DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis. DeePFAS projects raw MS/MS data into the latent space of chemical structures for PFAS (Per- and Polyfluoroalkyl Substances) identification, facilitating the inference of structurally similar compounds by comparing spectra to multiple candidate molecules within this latent chemical space.
Run the following code to install DeePFAS
git clone git@github.com:CMDM-Lab/DeePFAS.git
conda create -n DeePFAS python=3.10.0 --yes
conda activate DeePFAS
cd DeePFAS/DeePFAS
pip install -r requirements.txt
cd DeePFAS/DeePFAS
mkdir ae/ae_saved
mkdir DeePFAS/deepfas_saved
./download_models.sh
The wastewaster sample was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry by for details.
cd DeePFAS/DeePFAS
mkdir dataset
./download_wwtp3.sh
The PFAS standard mixtures was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry for details
cd DeePFAS/DeePFAS
mkdir dataset
./download_std_150.sh
The NIST PFAS Database (version 1.1) is a public database and can be downloaded on https://data.nist.gov/od/id/mds2-2905 with SQLite format. The MGF-format file was extracted and converted by Heng Wang.
cd DeePFAS/DeePFAS
mkdir dataset
./download_nist_pfas.sh
A small molecule database mol_dataset/mol_database.hdf5
includes approximately 50000 molecules
for rapid testing and PFAS annotation. Larger molecule database within chemical embedding
is available on huggingface
Please convert MS/MS spectra as .mgf
format and execute script test_deepfas.sh
to quickstart PFAS annotation
cd DeePFAS/DeePFAS
./test_deepfas.sh
.mgf
file is converted by python package pyteomics
from pyteomics import mgf
import numpy as np
data = []
intensity = [0.1, 1.0, 0.3, 0.4]
m_z = [11.1, 23.23, 111.44, 55.2]
spectrum = {
'params': {
# identifier of spectra in .mgf file (necessary)
'title': 0,
# ms level (necessary)
'mslevel': 2,
# precursor m/z (necessary)
'pepmass': 562.957580566406,
# adduct type (necessary)
'precursor_type': '[M-H]-',
# In eval mode, canonicalsmiless is necessary (unnecessary)
'canonicalsmiles': 'O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F',
# collision energy (necessary)
# absolute collision energy (ACE) format: 'collision_energy': 12
# normalized collision energy (NCE) format: 'collision_energy': 'NCE=37.5%'
'collision_energy': 'NCE=37.5%'
},
# m/z array (necessary)
'm/z array': np.array(intensity),
# intensity array (necessary)
'intensity array': np.array(m_z)
}
data.append(spectrum)
mgf.write(data, 'spectra.mgf', file_mode='w', write_charges=False)
Molecule library and its chemical embedding are stored as .hdf5
format in order to save storage space. Overwrite path of molecule file to dataset_path
in gen_latent_space_config.json
cd DeePFAS/DeePFAS
python3 ae/gen_latent_space.py \
--deepfas_config_pth DeePFAS/config/deepfas_config.json \
--ae_config_pth ae/config/gen_latent_space_config.json \
--latent_space_out_pth customized_mol_database.hdf5 \
--chunk_size 100000 \
--compression_level 9