Skip to content
cschaerfe edited this page Jul 8, 2015 · 3 revisions

For a simple example, you might want to automatically generate the most appropriate type of QSAR model for a given training data set and then predict the binding free energy of different compounds to the same molecular target and filter those molecules with the highest predicted binding affinity.

The example shown above (generated with the integration of CADDSuite into Galaxy) employs the following steps:

Both the training data set and the set of compounds whose activity is to be predicted should be converted into 3D structures and checked for errors first (Ligand3DGenerator and LigCheck). Ligand3DGenerator should be used first on the set of compounds that are to be docked. This tool protonates all molecules and generates 3D conformations for each of them. This is important since molecules obtained from many sources often contain only 2D conformations and are lacking hydrogens.

  • LigCheck then performs chemical sanity checks on the compounds. It checks for sensible bond-lengths, valid assigned elements and tests whether each 'molecule' in the input file contains only one actual molecule, i.e. it assess whether there are no unconnected atoms or fragments. Furthermore, each conformation (or, optionally, each topology) may appear only once within the given file. All molecules that pass these checks are written to the output file.
  • InputReader reads an sd-file containing the molecules of the training data set. This file contains the experimentally determined binding free energy (or another kind of biological activity) of each compound in a property tag (to be specified, if CADDSuite is used on the command line, with the parameter '-act'). The tool InputReader reads the input, generates a set of descriptors for each molecule and saves the output to a data file.
  • AutoModel is then used to try to find the best QSAR model for this data set. It therefore applies nested validation, including several feature selection and model/kernel parameter optimization steps, for each available model-type. A model of the type that achieved the best nested prediction quality is then generated and saved to the specified output file. However, if the best obtained nested prediction quality is smaller than a threshold (can be changed with '-min_quality'), an error will be shown and no model will be saved. This is an important feasibility check that enhances the probability that the use of the generated QSAR model will result in good predictions (as long as the prediction data set is not too chemically dissimilar to the training data set; you can use the SimilarityAnalyzer tool to assess this).
  • MolPredictor then uses the model file generated by AutoModel and an sd-file containing the compounds whose binding free energy is to be predicted. MolPredictor automatically generates all descriptors for this data set and predicts the activity of each molecule afterwards. Output of MolPredictor is a molecule file containing the predicted values as a property tag named 'predicted_activity'.
  • DockResultMerger is finally used to filter all compounds with a good predicted binding free energy, e.g. smaller than -40 kJ/mol ('-score predicted_activity -max -40' on the command line). On the command line, the equivalent of the above pipeline might look like this:
BALL/build/bin/TOOLS/Ligand3DGenerator -i input.sdf -o input_3D.sdf
BALL/build/bin/TOOLS/LigCheck -i input_3D.sdf -ri -o input_valid.sdf
BALL/build/bin/TOOLS/InputReader -i input_valid.sdf -act Activity -o training_set.dat
BALL/build/bin/TOOLS/AutoModel -i training_set.dat -o qsar_model.mod
BALL/build/bin/TOOLS/Ligand3DGenerator -i new_compounds.sdf -o new_compounds_3D.sdf
BALL/build/bin/TOOLS/LigCheck -i new_compounds_3D.sdf -ri -o new_compounds_valid.sdf
BALL/build/bin/TOOLS/MolPredictor -i new_compounds_valid.sdf -mod qsar_model.mod -o predictions_new_compounds.sdf
BALL/build/bin/TOOLS/DockResultMerger -i predictions_new_compounds.sdf -score predicted_activity -max -40 -o filtered_predictions.sdf

If you however would like to create more individual pipelines, BALL and CADDSuite provide a lot of other QSAR tools to achieve this. Just have a look at the list of tools.

Clone this wiki locally