Skip to content
Jon Ison edited this page Feb 18, 2016 · 11 revisions

Outline for paper

  • General problems in finding the right tools and how to combine them, motivating example
  • EDAM and ms-utils.org http://ms-utils.org/ provide sufficient information to create workflows (semi-)automatically
  • jABC/PROPHETS as workflow framework that can exploit this information
  • Case Study
    • Illustrate complexity - Details on how many combinations can be possible to get from input A to output B
    • Recipe to select most appropriate workflows for a specific problem (mostly a nice example how to do it)
  • Reasons why this makes the world a better place showing the great potential to extend the concept to all tools described by EDAM

Motivation

VS says ... "It would be great to show that there are actually quite a few workflows that can be generated from the available annotation of the tools. Do the workflow tools already have a feature to investigate this? If not, a simple network analysis should provide this information easily. And hopefully some nice figures as well."

AL says ... "The exploration of the possible workflows based on the annotations (without requiring the user to deal with the technicalities of the tool interfaces, such as input and output) is exactly what we aim to support with the PROPHETS plugin. The synthesis method it uses to compute the possible solutions is in fact in principle a network analysis, however a bit more sophisticated than "simple", for example allowing the user to express additional constraints that he wants to see fulfilled in the proposed workflows. This is described in publications on the framework, but I will also do my best to explain how it works on our example, that’s probably easier to follow then. PROPHETS also has some features to visualise the “solution space”, that is, the computed possible solutions, as a graph structure, so that might be one of the kind of figures you are hoping for."

Magnus says ... "As a very small contribution to the motivational part, demonstrating the need for controlled vocabularies in metadata descriptions, one of my PhD students just got a paper accepted in the Journal of Scientometrics. Among other things, the paper looked at the word usage of researchers active in the same field and with a long history of publishing together, one example being Gary Glish and Scott McLuckey (see attachment). These names may not mean anything to you, but they are well known in the field of mass spectrometry (both are former presidents of the American Society for Mass Spectrometry). But the analysis showed that even these two researchers have individual language preferences. For example, Glish likes to specify that an apparatus was a “quadrupole ion trap” whereas McLuckey refers to the same apparatus as only an “ion trap”. Of course, some domain knowledge is needed to tease this out from the graph. We have a workflow that performs this type of analysis for any two researchers (or corpora), so if you have some example from the field of bioinformatics, we could include this as well. Not that this is really novel in any way. I suspect people in the field of proteomics or mass spec data analysis use terms like “alignment”, “search”, “annotation”, “convolution” to mean more than one thing. At the very least, it is possible to use established text mining methods to show the need for metadata and data analysis CVs."

Choice of tools

Completing the Google spreadsheet: https://docs.google.com/spreadsheets/d/1DuYkelKKmgkFWHojbHE8wlJxFZhC2uZ8U8eUlgrAtNo/edit#gid=0

for a subset of selected tools for the test case sounds like a very good idea to me! I’m happy to follow your suggestions about which tools to take for that, from my side it would only be good to have tools with some command line or otherwise programmatically accessible interface (so that they can be executed from the workflow), and that the set of tools allows for some variation in the possible compositions, so that we can demonstrate the use of automatically exploring them. All this seems to be fulfilled by the list of tools you have already proposed.

Jon says ... "I'd prefer to annotate all tools to a high standard, so we can close this chapter. Of course we can make the ones required for the paper / example the priority"

Magnus says ... "In principle I agree, but we could focus on providing rich annotations on tools that are easy to integrate in workflows, i.e. can be run on the command line or are available as a Web Service.

Point for paper : hiding complexity

The paragraph you wrote about which tools can use which input formats show very nicely the technical knowledge that is required. That this "disappears" into the annotations and the ontology, so that it is used by the workflow framework but hidden from the user, will make a good illustration for a paper...

biotoolsCompose on GitHub

We can’t put the source code of PROPHETS etc. there, but of course links to the download sites of the releases, and I see no problems at the moment with putting the code and examples there that we create for this case. (Technically, what we create is “only” a PROPHETS project anyway, which like projects in Eclipse or other IDEs contains everything that is needed for an application, and can be stored and distributed independently.)

PROPHETS website (with download link): PROPHETS

Command-line tools for example workflow

Skipping the calibration step for now:

  • msconvert or compassXport for conversion from raw data to mzXML or mzML
  • X!Tandem, Comet (comes with TPP) or MS-GF+ for peptide identification
  • PeptideProphet for validation (works with X!Tandem and Comet at least, in theory also MS-GF+)
  • rt (3.0) or SSRCalc (the version in TPP) for retention time modelling
  • mzXMLplot or Pep3D (from TPP) for LC-MS data visualization

We can annotate the individual tools from the Trans-Proteomic Pipeline (TPP), as they can be run independently from other tools. As far as I know, most combinations of the above should work.

For X!Tandem the additional tandem2xml format converter is needed, as the X!Tandem XML output is neither pepXML nor mzIdentML. At the moment, rt only reads pepXML, so this is one "constraint". The peptide identification/validation tools should be able to write either mzIdentML or pepXML, and there is also a (lossy) converter between these two peptide list formats that should at least preserve what is needed for retention time modelling. Likewise, mzXMLplot only plots mzXML, whereas Pep3D should handle both mzML and mzXML.

Need for new tools

To start with a "low-threshold", I agree it makes sense to do a proof-of-concept in a subdomain which is already quite densely populated with software utilities. From our side we can also make it even denser, by adding format converters and tools for merging (composing) and splitting (decomposing) files, to allow more possibilities. We already have some of these tools developed for cloud computing.

More details on the use-case

I can imagine an analytical chemist who has developed a new chromatographic method for peptide separation and now wants to know how the chromatographic separation depends on the amino acid composition of the peptides. For this, one can do a standard proteomics measurement using liquid chromatography and tandem mass spectrometry, try to identify as many peptides as possible, validate these identifications, and then calculate amino acid indices [e.g. Amino acid index (hydropathy), http://edamontology.org/data_1506].

Very roughly, a generic workflow for this use case could look like this (here with all workflow input/output as EDAM Data and components with EDAM Operation on top and EDAM Format in the ports, just to show what I mean) (see attached image).

For most of these operations there is more than one choice already, and the common (vendor) raw data formats are being added to EDAM. As alternatives to mzML and mzIdentML there are mzXML and pepXML, i.e. at least two alternatives for each data format. The analytical chemist does not know anything about these formats, (s)he only has the raw mass spectrometrydata from the peptides, and wants the amino acid indices out.

bio.tools software model

Please take a look at how input/output is modelled in bio.tools: https://bio.tools/resources/biotools-1.4/docs/biotools-1.4.html#Link22

You'll see a tool can have more than one "function" (EDAM Operation), each of which can have more than one "input" and more than one "output". Each "input" or "output" must have exactly one "dataType" (EDAM Data) and can have 0 or more supported "dataFormat" (EDAM Format).

For the moment we proceed with the simple annotations (esp. single inputs/outputs). Having a closer look at the bio.tools modelling definitely should go on the list of things to discuss further.

Earlier work

If you're looking for more information on our work in this area (and also a younger EDAM example), you'll find most in AL's thesis book (available at http://link.springer.com/book/10.1007%2F978-3-642-45389-2

A workflow element description language to specify wrappers that influenced Taverna was Hajo N. /Krabbenhöft/, Steffen /Möller/, and Daniel Bayer. Integrating ARC Grid Middleware with Taverna Workflows. /Bioinformatics/, 24(9):1221--1222, /2008/
The current URL is http://taverna.nordugrid.org

OPPL-Galaxy, a Galaxy tool for enhancing ontology exploitation as part of bioinformatics workflows. J Biomed Semantics; 2013;4(1):2.