Skip to content

Reinforcement learning for improving components in fungal BGCs

License

Notifications You must be signed in to change notification settings

bioinfoUQAM/RL-bgc-components

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RL for BGC components

A reinforcement learning approach to support improving components in fungal candidate BGCs, based on Pfam protein domains and optional use of BGC functional annotations.

Requirements

Unix/Linux and Python 3.6+ are recommended. Library dependencies can be found in /src/requirements.txt, and should be installed in the project virtualenv before starting. PySpark requires Java 8 or later (recommended 8 or 11) to be installed, and JAVA_HOME set. On Windows environments PySpark may require additional steps, such as obtaining Hadoop native libraries for Windows and setting HADOOP_HOME.

How to start

Make a copy of /src/config.init.DEFAULT, and rename it to /src/config.init. Update the [default] home to the current project root path.

Train - configure & run

At the [prediction] section in the config.init file, specify the minimum parameters accordingly:

  • indicate the corpus location in source.path
  • specify the learner parameters as desired in episodes, alpha, gamma, epsilon, penalty.threshold, and keepskip.threshold

To train the reinforcement learner, from the project virtualenv simply run:

(.env) user@foo:~RL-bgc-components/src$ python -m pipeprediction.RL

Training data can be obtained at this repository, and training files should be placed at /corpus/train. Sample training data files are provided in /corpus/train. Model and feature files are outputted in /corpus/metrics/models. Trained model files (based on best performing parameters and balanced dataset) are also provided in /corpus/metrics/models.

Test - configure & run

Candidate BGC predictions to be optimized can be obtained using a BGC prediction tool, such as TOUCAN. The inputted candidate BGCs must be in a tab-separated values (TSV) named *.IDs.test file, containing the genes/regions in a candidate BGC and its predicted label. The input file is placed in the /corpus/metrics folder, as the sample file /corpus/metrics/sample-candidateBGCs.IDs.test.

At the [prediction] section in the config.init file:

  • set True to parameters neighbor.weight, dry.islands, and average.action to use the functional annotation strategies available

At the [eval] section in the config.init file, use the following parameters to indicate the requested inputs:

  • result.path: file with list of candidate BGC predictions
  • goldID.path: file with list of gold BGC clusters (for comparison, if available)
  • similarity.path: file with output from a BLAST all-vs-all for target genome (if available)
  • gene.length: file with list of amino acid length1 for each gene (or designated genome regions)
  • gene.map: path for all files of domains per gene2 (or designated genome regions)

Sample files are provided in /Databases and /corpus/metrics.

To use the reinforcement learner and evaluate candidate BGCs, from the project virtualenv simply run:

(.env) user@foo:~RL-bgc-components/src$ python -m eval.Evaluator

Result files are outputted in /corpus/metrics/.

Footnotes

  1. Amino acid sequence lengths for candidate BGC genes can be extracted from the FASTA sequence file(s), and listed in the same format as /Databases/sample-geneLength.

  2. Pfam protein domains per genes can be obtained using PfamScan, and listed in the same format as /Databases/sample-geneMap/*.domains (similar to FASTA).

About

Reinforcement learning for improving components in fungal BGCs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages