Skip to content

1. Intro to CortexJDK

Kiran V Garimella edited this page Jul 18, 2019 · 1 revision

What is CortexJDK?

CortexJDK is a Java class library for the fast and low-memory inspection, manipulation, and traversal of multi-sample de Bruijn genome graphs and connectivity information in the Cortex/McCortex formats (.ctx and .ctp.gz).

CortexJDK enables a developer to do many graph-related tasks, including:

  • iterate over records in a Cortex graphs
  • access arbitrary records in Cortex graphs via binary search
  • dynamically merge multiple lexicographically sorted graphs
  • perform simple graphs walks (i.e. extracting a contig, optionally using links to disambiguate junction choices)
  • perform graphs walks assisted by one or more reference sequences in a manner consistent with link information
  • perform depth-first searches with custom stopping rules (useful for finding interesting graph motifs)
  • align k-mers and contigs back to reference sequences

Design

CortexJDK is heavily inspired by htsjdk and the GATK in its design; it can be used both as a class library or as a framework for writing new graph-based applications. Substantial effort has been applied to make routine tasks as easy as possible for the developer while the library takes care of the heavy lifting behind the scenes.

Example 1 (Join), annotated and explained

Below is the entire code (sans imports) listing for a console program that takes N lexicographically-sorted graphs and outputs a merged graph. It is just 15 lines of code: 10 lines of boilerplate and 5 additional lines to do the actual work.

@Description(text="Dynamically merge N lexicographically sorted graphs into a single graph")
public class Join extends Module {                           // create a new console program called Join
    @Argument(fullName="graph", shortName="g", doc="Graph")  // take as inputs an arbitrary number of graphs
    public ArrayList<CortexGraph> GRAPHS;                    // present them to the program as the, list GRAPHS

    @Output                                                  // write output to the user-specified location
    public File out;

    @Override
    public void execute() {
        CortexCollection cc = new CortexCollection(GRAPHS);  // create an on-the-fly graph merging object

        CortexGraphWriter cgw = new CortexGraphWriter(out);  // open the output file for the joined graph
        cgw.setHeader(cc.getHeader());                       // write the merged header to the output

        for (CortexRecord cr : cc) { cgw.addRecord(cr); }    // for each merged record, write it to the output

        cgw.close();                                         // close the output file
    }
}

CortexJDK presents this new module, Join, as a console application with two command-line arguments and a program description. Run without any arguments, the following output is generated:

$ java -jar dist/cortexjdk.jar Join
I [2019-07-15 11:36 8258] CortexJDK 0.4-29aad; (repo) 2019-07-15 11:05:28 -0400; (build) 2019-07-15 11:36:49 -0400
I [2019-07-15 11:36 8258] java -Xmx4g -jar cortexjdk.jar Join
I [2019-07-15 11:36 8258]

Usage: java -jar cortexjdk.jar Join [arguments]
Dynamically merge N lexicographically sorted graphs into a single graph
 -g,--graph <arg>    Graph [can be specified more than once]
 -h,--help           Show this help message
 -o,--output <arg>   The output file [default: /dev/null]

The first argument, --graph (or -g), accepts one or more graph files (as determined by the argument's type, ArrayList<CortexGraph>). The second argument, --output (or -o), are the default names for any variable annotated as @Output rather than @Argument (which helps to enforce a level of command-line argument consistency among the tools, although the names can be explicitly set as well, as is often necessary with programs having multiple outputs).

The CortexCollection object handles the process of merging N lexicographically-sorted lists without loading everything into memory (it advances file pointers on each list, dynamically merging records and presenting them to the developer as if they originated from a single merged graph). From there, merging the graphs is as simple as iterating over every record given by the CortexCollection and adding the record to the CortexGraphWriter object. For good housekeeping, we close() the graph writer before the program exits.

The following command line runs this application:

$ cd <path to CortexJDK repository>
$ ant
$ java -jar dist/cortexjdk.jar Join -g graph1.ctx -g graph2.ctx -o graph_joined.ctx

Example 2 (Contig), annotated and explained

CortexJDK makes traversals on the graphs straightforward as well. The following short program extracts a contig seeded by a single k-mer from all colors of a graph.

@Description(text="Extract a contig seeded by a source k-mer from all graph colors")
public class Contig extends Module {
    @Argument(fullName="graph", shortName="g", doc="Graph")
    public CortexGraph GRAPH;                                // take a graph as input and present it as GRAPH

    @Argument(fullName="source", shortName="s", doc="Starting (source) k-mer")
    public String SOURCE;                                    // take a k-mer as input and present it as SOURCE

    @Output                                                  
    public PrintStream out;                                  // write output (by default, to stdout)

    @Override
    public void execute() {
        for (int c = 0; c < GRAPH.getNumColors(); c++) {     // iterate over graph colors
            TraversalEngine e = new TraversalEngineFactory() // configure a traveral engine object
                    .traversalColors(c)                      // - for color 'c'
                    .traversalDirection(BOTH)                // - traverse both directions
                    .combinationOperator(OR)                 // - combine if either direction is successful
                    .stoppingRule(ContigStopper.class)       // - use the rules found here to stop the traversal
                    .graph(GRAPH)                            // - conduct all traversals on GRAPH
                    .make();                                 // - make the final engine

            // use the engine to walk the graph, returning a list of nodes that can be converted to a contig
            String contig = TraversalUtils.toContig(e.walk(SOURCE));

            // output the contig to the console
            out.println(String.format("source: %s color: %d contig: %s", SOURCE, c, contig));
        }
    }
}

Here, we iterate over all colors in the graph, instantiate a TraversalEngine object to be used with each one using the TraversalEngineFactory for easy configuration. We then walk the graph in both directions starting at the SOURCE k-mer and stopping when the criteria set by ContigStopper are satisfied. We convert the walk to a contig and write it to the output file.