-
Notifications
You must be signed in to change notification settings - Fork 3
1. Intro to CortexJDK
CortexJDK is a Java class library for the fast and low-memory inspection, manipulation, and traversal of multi-sample de Bruijn genome graphs and connectivity information in the Cortex/McCortex formats (.ctx and .ctp.gz).
CortexJDK enables a developer to do many graph-related tasks, including:
- iterate over records in a Cortex graphs
- access arbitrary records in Cortex graphs via binary search
- dynamically merge multiple lexicographically sorted graphs
- perform simple graphs walks (i.e. extracting a contig, optionally using links to disambiguate junction choices)
- perform graphs walks assisted by one or more reference sequences in a manner consistent with link information
- perform depth-first searches with custom stopping rules (useful for finding interesting graph motifs)
- align k-mers and contigs back to reference sequences
CortexJDK is heavily inspired by htsjdk and the GATK in its design; it can be used both as a class library or as a framework for writing new graph-based applications. Substantial effort has been applied to make routine tasks as easy as possible for the developer while the library takes care of the heavy lifting behind the scenes.
Below is the entire code (sans imports) listing for a console program that takes N lexicographically-sorted graphs and outputs a merged graph. It is just 15 lines of code: 10 lines of boilerplate and 5 additional lines to do the actual work.
@Description(text="Dynamically merge N lexicographically sorted graphs into a single graph")
public class Join extends Module { // create a new console program called Join
@Argument(fullName="graph", shortName="g", doc="Graph") // take as inputs an arbitrary number of graphs
public ArrayList<CortexGraph> GRAPHS; // present them to the program as the, list GRAPHS
@Output // write output to the user-specified location
public File out;
@Override
public void execute() {
CortexCollection cc = new CortexCollection(GRAPHS); // create an on-the-fly graph merging object
CortexGraphWriter cgw = new CortexGraphWriter(out); // open the output file for the joined graph
cgw.setHeader(cc.getHeader()); // write the merged header to the output
for (CortexRecord cr : cc) { cgw.addRecord(cr); } // for each merged record, write it to the output
cgw.close(); // close the output file
}
}
CortexJDK presents this new module, Join
, as a console application with two command-line arguments and a program description. Run without any arguments, the following output is generated:
$ java -jar dist/cortexjdk.jar Join
I [2019-07-15 11:36 8258] CortexJDK 0.4-29aad; (repo) 2019-07-15 11:05:28 -0400; (build) 2019-07-15 11:36:49 -0400
I [2019-07-15 11:36 8258] java -Xmx4g -jar cortexjdk.jar Join
I [2019-07-15 11:36 8258]
Usage: java -jar cortexjdk.jar Join [arguments]
Dynamically merge N lexicographically sorted graphs into a single graph
-g,--graph <arg> Graph [can be specified more than once]
-h,--help Show this help message
-o,--output <arg> The output file [default: /dev/null]
The first argument, --graph
(or -g
), accepts one or more graph files (as determined by the argument's type, ArrayList<CortexGraph>
). The second argument, --output
(or -o
), are the default names for any variable annotated as @Output
rather than @Argument
(which helps to enforce a level of command-line argument consistency among the tools, although the names can be explicitly set as well, as is often necessary with programs having multiple outputs).
The CortexCollection
object handles the process of merging N lexicographically-sorted lists without loading everything into memory (it advances file pointers on each list, dynamically merging records and presenting them to the developer as if they originated from a single merged graph). From there, merging the graphs is as simple as iterating over every record given by the CortexCollection
and adding the record to the CortexGraphWriter
object. For good housekeeping, we close()
the graph writer before the program exits.
The following command line runs this application:
$ cd <path to CortexJDK repository>
$ ant
$ java -jar dist/cortexjdk.jar Join -g graph1.ctx -g graph2.ctx -o graph_joined.ctx
CortexJDK makes traversals on the graphs straightforward as well. The following short program extracts a contig seeded by a single k-mer from all colors of a graph.
@Description(text="Extract a contig seeded by a source k-mer from all graph colors")
public class Contig extends Module {
@Argument(fullName="graph", shortName="g", doc="Graph")
public CortexGraph GRAPH; // take a graph as input and present it as GRAPH
@Argument(fullName="source", shortName="s", doc="Starting (source) k-mer")
public String SOURCE; // take a k-mer as input and present it as SOURCE
@Output
public PrintStream out; // write output (by default, to stdout)
@Override
public void execute() {
for (int c = 0; c < GRAPH.getNumColors(); c++) { // iterate over graph colors
TraversalEngine e = new TraversalEngineFactory() // configure a traveral engine object
.traversalColors(c) // - for color 'c'
.traversalDirection(BOTH) // - traverse both directions
.combinationOperator(OR) // - combine if either direction is successful
.stoppingRule(ContigStopper.class) // - use the rules found here to stop the traversal
.graph(GRAPH) // - conduct all traversals on GRAPH
.make(); // - make the final engine
// use the engine to walk the graph, returning a list of nodes that can be converted to a contig
String contig = TraversalUtils.toContig(e.walk(SOURCE));
// output the contig to the console
out.println(String.format("source: %s color: %d contig: %s", SOURCE, c, contig));
}
}
}
Here, we iterate over all colors in the graph, instantiate a TraversalEngine
object to be used with each one using the TraversalEngineFactory
for easy configuration. We then walk the graph in both directions starting at the SOURCE
k-mer and stopping when the criteria set by ContigStopper
are satisfied. We convert the walk to a contig and write it to the output file.