Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various tools for assembling directly from the graph, including using labels to track paths. #1412

Merged
merged 97 commits into from
Oct 31, 2016

Conversation

camillescott
Copy link
Member

@camillescott camillescott commented Jul 25, 2016

(Forked from #1377, which is now closed.)

See especially sandbox/link-compact-dbg.py, end of tests/test_nodegraph.py, and sandbox/assemble-on-the-go.py

You need the fix/reverse_complement latest branch of screed: pip install git+https://github.com/dib-lab/screed.git@fix/reverse_complement

Note: incorporates linear-path contig assembly (#1372).

Camille's train of thought:

  • Move assemble-related stuff into a new Assembler class.

    Reasoning: Fits in line with overarching goal of stripping stuff out of Hashtable, such as was done with the Traverser class. Hashtable is already large and unwieldy.

  • Fix the segfaut in labelled assembly.

    Reasoning: It doesn't work! More precisely, the labelled assembly function appears to get stuck when confronted with cycles of a certain variety.

  • Subclass Assembler for LabeledAssembler

    Reasoning: See reasoning for Assembler class.

  • Add an _assemble_left function.

    Reasoning: Obvious extension of having an assemble right function.

  • Let assemble_linear_path take a std::function for filtering.

Changes that Might be Made

  • Use Traverser for Assembler

    Unfortunately, after trying to implement this myself, it's clear why @ctb didn't use it for this in the first place -- it's very geared toward agnostically walking around integer k-mers, and difficult / inefficient to extend to string traversal. This should probably change in the future, but not before tip detection is in. Update: this is now working.

Obligatory checklist (necessary for a PR to a PR?)

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

ctb and others added 30 commits June 9, 2016 08:03
@camillescott
Copy link
Member Author

@ctb i'm sure the compact dbg stuff could use these fixtures, but that's out of scope of this PR

@camillescott
Copy link
Member Author

TODO for me: add a test for repeat sequences.

In particular, to cover a bug discovered in the NonLoopingAT, where the cursor was only added the seen-set when next_symbol was called. When neighbors were found with the neighbors function, the HDN junction was not added, allowing a "leak".

@betatim
Copy link
Member

betatim commented Oct 11, 2016

To help getting this reviewed and merged, can we take parts of this PR (starting from the lowest layer working upwards), split it out to a separate PR, review and merge it? Github's tools for seeing what we discussed already and what not etc are a bit crappy. It would also structure things a bit in terms of what is done, what is still in flux, etc. I would like that but obviously up to @camillescott.

(There must be some law of software dev somewhere stating that quality of code review declines exponentially with size of code to be reviewed)

@ctb
Copy link
Member

ctb commented Oct 11, 2016

I'm a bit confused here -- @camillescott is this PR fully baked at this point, other than changes requested under review? If so, @betatim I didn't think it was that big & obnoxious but I can try to pull out bits and pieces. If it's still being updated by Camille tho I am not that thrilled about that approach.

@camillescott
Copy link
Member Author

@ctb i'd call it baked and ready to be frosted -- further work would be best served by having this merged.

I see little point in pulling it apart into smaller pieces -- it's all quite tightly coupled, and I think, well tested. Perhaps the refactoring of the Traverser class and its use elsewhere in the code, along with the introduction of alphabets, could be separated, but I'm -1 on being burdened by more logistics when I've got a clock ticking for getting actual research done. @ctb will have to make the final call there.

@ctb
Copy link
Member

ctb commented Oct 13, 2016

On Thu, Oct 13, 2016 at 10:53:13AM -0700, Camille Scott wrote:

@ctb i'd call it baked and ready to be frosted -- further work would be best served by having this merged.

lol

I see little point in pulling it apart into smaller pieces -- it's all quite tightly coupled, and I think, well tested. Perhaps the refactoring of the Traverser class and its use elsewhere in the code, along with the introduction of alphabets, could be separated, but I'm -1 on being burdened by more logistics when I've got a clock ticking for getting actual research done. @ctb will have to make the final call there.

Right, I would be pulling it apart as part of the code review :).

@camillescott
Copy link
Member Author

In that case, carry on :)

@ctb
Copy link
Member

ctb commented Oct 13, 2016 via email

@betatim
Copy link
Member

betatim commented Oct 14, 2016

If its done then that is fine, I just wasn't sure if there was more to be added in which case we could have started merging stuff.




/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete me?

@betatim
Copy link
Member

betatim commented Oct 14, 2016

I've read the c++ a few times now and we could probably discuss for ages on rearranging things and why stuff is done the way it is but given that the tests pass and that there are quite a few comments in the c++ part (yay!) we could merge this and if someone feels strongly enough about it move things around later.

For the tests, what do you think about my comment about using a separate RNG instance everywhere? Mainly tedious leg work to add it :-/

I've not looked at the sandbox/ scripts.

@ctb
Copy link
Member

ctb commented Oct 30, 2016

I am +1 on merging it. @betatim can you extract your RNG comment into an issue?

@ctb
Copy link
Member

ctb commented Oct 30, 2016

I guess since I made the last commit, I can't review. So, @betatim or @camillescott, go ahead and review & merge.

@betatim
Copy link
Member

betatim commented Oct 31, 2016

One more merge of master into this branch, then a second commit that only updates the formatting according to make format.

Copy link
Member

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks!

@ctb ctb merged commit ab262a6 into master Oct 31, 2016
@ctb ctb deleted the feature/review/pathlink branch October 31, 2016 13:00
@betatim
Copy link
Member

betatim commented Oct 31, 2016

🍻 good work everyone! Thanks for the patience with my complaining about code volume @camillescott .

@camillescott camillescott restored the feature/review/pathlink branch November 1, 2016 21:52
@standage standage deleted the feature/review/pathlink branch November 16, 2016 05:52
Dmarch28 pushed a commit to Dmarch28/khmer that referenced this pull request Mar 10, 2021
… labels to track paths. (dib-lab#1412)

* basic compilation works

* track labels in collapsed linear paths

* cleanup & commenting

* add assemble_labeled_path

* compile, basic fns & API

* changed fn signature of assemble_labeled_right

* seems to be ...working?

* made left + right assembly work

* a first attempt at streaming assembly

* basic stuff is working

* fix print function py2 issue

* track visited & avoid infinite loops

* add extract-unassembled-reads*py

* fix typo

* fixed bug with 'temporary' use of std:string

* update extract-unassembled-reads* sandbox scripts for python3

* fix @ljcohen short sequence issue?

* Make build_kmer const method

* First pass assembler class

* Add an assemble_left function to match assemble_right

* TEMPORARILY add github target for screed bugfix to pass tests

* TEMPORARILY add github target to all target for testing

* Add TEMP screed install to install deps, roll back other stuff

* Fix pep8 for jenkins

* Add node filtering functions

* Add more debugging output

* Turn on debugging

* Update tag and label getters to take a reference to Tag and Label sets

* Create a global collection of symbol alphabets

* Add symbols.hh and cc to setup.py

* Add tracking info assembly script; add specialized assembler traverser

* Expose reverse complement function to Python land

* Add import for reverse_complement

* Add string representation for Kmer

* Modify AssemblerTraverser to only emit found bases; update Assembler accordingly, rename to LinearAssembler

* Check for orientation of seed kmer

* Remove some debugging output

* Add a template parameter for the direction of the AssemblerTraverser.

* Fix direction parameters

* Fix syntax error from bad patch

* Replace function redirection with template specialization

* First pass at LabeledLinearAssembler

* Refactor labeled assembly into LabeledLinearAssembler, right assembly working

* Fill out direction templating for Labeled assembly, all tests passing

* explicitly add assembler functions to khmer namespace instead of a using clause, to allow comp with gcc

* Reorganize AssemblerTraverser into traversal

* Try declaring template specs

* Remove const qualifier on SeenSet * visited to allow call to insert

* Add new headers to liboxli Makefile

* Make Pep8 compliant, update _equals_rc to use new khmer rc method

* Remove -Wmaybe-uninitialized compiler warning at traversal.cc 225

* Update ChangeLog

* Cleanup some commented code and add some documentation to LabeledLinearAssembler

* Do autoformat

* Refactor Traverser into two new classes, NodeGatherer and NodeCursor, which follow the direction-templated model used for the AssemblerTraverser classes.

* LabeledLinearAssembler renamed to NaiveLabeledAssembler; remove redudant contig production and add basic tip detection

* Change alphabets to string and iterate with auto for loop; rename symbols to alphabets for clarity

* Add an additional revcomp check and mutate to be sure that all contigs have proper interleaving

* Remove unused set_cursor; allow an offset for join_contigs

* Rename labeled assembler, enable proper branching along with simple tip detection

* Revert setup.py compile arg change

* Add updated doxygen comments and move a couple function to their base classes

* Remove sneaky alphabet from Makefile

* Move assembly tests to their own file and add some utilities (with tests) for assembly testing

* Rename some tests, cluster with Classes, add clear comments with graph structures

* remove interrupted test, as it is not necessary and covered by test_right_of_branch_outwards_to_ends

* Rewrite assembly tests with fixtures, parameterize for sequence length and content, branch location, and graph type

* Allow passing stop_bf to labeled assembler from Pythonland

* Add graph structure fixtures for forks and bubbles, and convert tests, plus new tests for bubbles

* Remove debug flags

* Add encoding to test_assembly.py

* pep8 compliance

* Fix some pep8 and a dumb error

* Convert SimpleLabeledAssembler to iterative impl

* Add a test for tandem repeats

* Change signature of AssemblerTraverser::next_symbol to virtual to allow proper dispatch to subclassed versions

* Formatting only changes via `make format`

* fix pep8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants