Skip to content

CMDM-Lab/DeePFAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis

This repository provides implementations and code examples for DeePFAS: Deep Learning-Enabled Rapid Annotation of PFAS: Enhancing Non-Targeted Screening through Spectral Encoding and Latent Space Analysis. DeePFAS projects raw MS/MS data into the latent space of chemical structures for PFAS (Per- and Polyfluoroalkyl Substances) identification, facilitating the inference of structurally similar compounds by comparing spectra to multiple candidate molecules within this latent chemical space.

Getting started

Installation

Run the following code to install DeePFAS

git clone git@github.com:CMDM-Lab/DeePFAS.git

conda create -n DeePFAS python=3.10.0 --yes
conda activate DeePFAS
cd DeePFAS/DeePFAS

pip install -r requirements.txt

Quickstart

Download pretrained models

cd DeePFAS/DeePFAS
mkdir ae/ae_saved
mkdir DeePFAS/deepfas_saved
./download_models.sh

Download the mass spectra of a wastewater sample (WWTP3)

The wastewaster sample was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry by for details.

cd DeePFAS/DeePFAS
mkdir dataset
./download_wwtp3.sh

Download the mass spectra of PFAS standard mixtures (std_150)

The PFAS standard mixtures was provided by Yi-Ju Chen. Please see the article Emerging Perfluorobutane Sulfonamido Derivatives as a New Trend of Surfactants Used in the Semiconductor Industry for details

cd DeePFAS/DeePFAS
mkdir dataset
./download_std_150.sh

Download the mass spectra of NIST PFAS database with MGF file format (Mascot Generic Format)

The NIST PFAS Database (version 1.1) is a public database and can be downloaded on https://data.nist.gov/od/id/mds2-2905 with SQLite format. The MGF-format file was extracted and converted by Heng Wang.

cd DeePFAS/DeePFAS
mkdir dataset
./download_nist_pfas.sh

Download PubChem molecule database with chemical embedding generated by AutoEncoder

A small molecule database mol_dataset/mol_database.hdf5 includes approximately 50000 molecules for rapid testing and PFAS annotation. Larger molecule database within chemical embedding is available on huggingface

PFAS annotation

Please convert MS/MS spectra as .mgf format and execute script test_deepfas.sh to quickstart PFAS annotation

cd DeePFAS/DeePFAS
./test_deepfas.sh

Convert MS/MS spectra data to .mgf format

.mgf file is converted by python package pyteomics

from pyteomics import mgf
import numpy as np
data = []

intensity = [0.1, 1.0, 0.3, 0.4]
m_z = [11.1, 23.23, 111.44, 55.2]
spectrum = {
    'params': {
        # identifier of spectra in .mgf file (necessary)
        'title': 0,
        # ms level (necessary)
        'mslevel': 2,
        # precursor m/z (necessary)
        'pepmass': 562.957580566406,
        # adduct type (necessary)
        'precursor_type': '[M-H]-',
        # In eval mode, canonicalsmiless is necessary (unnecessary)
        'canonicalsmiles': 'O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F',
        # collision energy (necessary)
        # absolute collision energy (ACE) format: 'collision_energy': 12
        # normalized collision energy (NCE) format: 'collision_energy': 'NCE=37.5%'
        'collision_energy': 'NCE=37.5%'
    },
    # m/z array (necessary)
    'm/z array': np.array(intensity), 
    # intensity array (necessary)
    'intensity array': np.array(m_z)
}


data.append(spectrum)
mgf.write(data, 'spectra.mgf', file_mode='w', write_charges=False)

Generate customized molecule library

Molecule library and its chemical embedding are stored as .hdf5 format in order to save storage space. Overwrite path of molecule file to dataset_path in gen_latent_space_config.json

cd DeePFAS/DeePFAS
python3 ae/gen_latent_space.py \
 --deepfas_config_pth DeePFAS/config/deepfas_config.json \
 --ae_config_pth ae/config/gen_latent_space_config.json \
 --latent_space_out_pth customized_mol_database.hdf5 \
 --chunk_size 100000 \
 --compression_level 9

About

Encoding MS/MS spectra to chemical representation for identification of PFAS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published