-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various tools for assembling directly from the graph, including using labels to track paths. #1412
Conversation
@ctb i'm sure the compact dbg stuff could use these fixtures, but that's out of scope of this PR |
TODO for me: add a test for repeat sequences. In particular, to cover a bug discovered in the NonLoopingAT, where the cursor was only added the seen-set when |
…ow proper dispatch to subclassed versions
To help getting this reviewed and merged, can we take parts of this PR (starting from the lowest layer working upwards), split it out to a separate PR, review and merge it? Github's tools for seeing what we discussed already and what not etc are a bit crappy. It would also structure things a bit in terms of what is done, what is still in flux, etc. I would like that but obviously up to @camillescott. (There must be some law of software dev somewhere stating that quality of code review declines exponentially with size of code to be reviewed) |
I'm a bit confused here -- @camillescott is this PR fully baked at this point, other than changes requested under review? If so, @betatim I didn't think it was that big & obnoxious but I can try to pull out bits and pieces. If it's still being updated by Camille tho I am not that thrilled about that approach. |
@ctb i'd call it baked and ready to be frosted -- further work would be best served by having this merged. I see little point in pulling it apart into smaller pieces -- it's all quite tightly coupled, and I think, well tested. Perhaps the refactoring of the |
On Thu, Oct 13, 2016 at 10:53:13AM -0700, Camille Scott wrote:
lol
Right, I would be pulling it apart as part of the code review :). |
In that case, carry on :) |
yes ma'am! :)
|
If its done then that is fine, I just wasn't sure if there was more to be added in which case we could have started merging stuff. |
|
||
|
||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete me?
I've read the c++ a few times now and we could probably discuss for ages on rearranging things and why stuff is done the way it is but given that the tests pass and that there are quite a few comments in the c++ part (yay!) we could merge this and if someone feels strongly enough about it move things around later. For the tests, what do you think about my comment about using a separate RNG instance everywhere? Mainly tedious leg work to add it :-/ I've not looked at the |
I am +1 on merging it. @betatim can you extract your RNG comment into an issue? |
I guess since I made the last commit, I can't review. So, @betatim or @camillescott, go ahead and review & merge. |
One more merge of master into this branch, then a second commit that only updates the formatting according to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks!
🍻 good work everyone! Thanks for the patience with my complaining about code volume @camillescott . |
… labels to track paths. (dib-lab#1412) * basic compilation works * track labels in collapsed linear paths * cleanup & commenting * add assemble_labeled_path * compile, basic fns & API * changed fn signature of assemble_labeled_right * seems to be ...working? * made left + right assembly work * a first attempt at streaming assembly * basic stuff is working * fix print function py2 issue * track visited & avoid infinite loops * add extract-unassembled-reads*py * fix typo * fixed bug with 'temporary' use of std:string * update extract-unassembled-reads* sandbox scripts for python3 * fix @ljcohen short sequence issue? * Make build_kmer const method * First pass assembler class * Add an assemble_left function to match assemble_right * TEMPORARILY add github target for screed bugfix to pass tests * TEMPORARILY add github target to all target for testing * Add TEMP screed install to install deps, roll back other stuff * Fix pep8 for jenkins * Add node filtering functions * Add more debugging output * Turn on debugging * Update tag and label getters to take a reference to Tag and Label sets * Create a global collection of symbol alphabets * Add symbols.hh and cc to setup.py * Add tracking info assembly script; add specialized assembler traverser * Expose reverse complement function to Python land * Add import for reverse_complement * Add string representation for Kmer * Modify AssemblerTraverser to only emit found bases; update Assembler accordingly, rename to LinearAssembler * Check for orientation of seed kmer * Remove some debugging output * Add a template parameter for the direction of the AssemblerTraverser. * Fix direction parameters * Fix syntax error from bad patch * Replace function redirection with template specialization * First pass at LabeledLinearAssembler * Refactor labeled assembly into LabeledLinearAssembler, right assembly working * Fill out direction templating for Labeled assembly, all tests passing * explicitly add assembler functions to khmer namespace instead of a using clause, to allow comp with gcc * Reorganize AssemblerTraverser into traversal * Try declaring template specs * Remove const qualifier on SeenSet * visited to allow call to insert * Add new headers to liboxli Makefile * Make Pep8 compliant, update _equals_rc to use new khmer rc method * Remove -Wmaybe-uninitialized compiler warning at traversal.cc 225 * Update ChangeLog * Cleanup some commented code and add some documentation to LabeledLinearAssembler * Do autoformat * Refactor Traverser into two new classes, NodeGatherer and NodeCursor, which follow the direction-templated model used for the AssemblerTraverser classes. * LabeledLinearAssembler renamed to NaiveLabeledAssembler; remove redudant contig production and add basic tip detection * Change alphabets to string and iterate with auto for loop; rename symbols to alphabets for clarity * Add an additional revcomp check and mutate to be sure that all contigs have proper interleaving * Remove unused set_cursor; allow an offset for join_contigs * Rename labeled assembler, enable proper branching along with simple tip detection * Revert setup.py compile arg change * Add updated doxygen comments and move a couple function to their base classes * Remove sneaky alphabet from Makefile * Move assembly tests to their own file and add some utilities (with tests) for assembly testing * Rename some tests, cluster with Classes, add clear comments with graph structures * remove interrupted test, as it is not necessary and covered by test_right_of_branch_outwards_to_ends * Rewrite assembly tests with fixtures, parameterize for sequence length and content, branch location, and graph type * Allow passing stop_bf to labeled assembler from Pythonland * Add graph structure fixtures for forks and bubbles, and convert tests, plus new tests for bubbles * Remove debug flags * Add encoding to test_assembly.py * pep8 compliance * Fix some pep8 and a dumb error * Convert SimpleLabeledAssembler to iterative impl * Add a test for tandem repeats * Change signature of AssemblerTraverser::next_symbol to virtual to allow proper dispatch to subclassed versions * Formatting only changes via `make format` * fix pep8
(Forked from #1377, which is now closed.)
See especially sandbox/link-compact-dbg.py, end of tests/test_nodegraph.py, and sandbox/assemble-on-the-go.py
You need the fix/reverse_complement latest branch of screed: pip install git+https://github.com/dib-lab/screed.git@fix/reverse_complement
Note: incorporates linear-path contig assembly (#1372).
Camille's train of thought:
Move assemble-related stuff into a new
Assembler
class.Reasoning: Fits in line with overarching goal of stripping stuff out of
Hashtable
, such as was done with theTraverser
class.Hashtable
is already large and unwieldy.Fix the segfaut in labelled assembly.
Reasoning: It doesn't work! More precisely, the labelled assembly function appears to get stuck when confronted with cycles of a certain variety.
Subclass
Assembler
forLabeledAssembler
Reasoning: See reasoning for
Assembler
class.Add an
_assemble_left
function.Reasoning: Obvious extension of having an assemble right function.
Let
assemble_linear_path
take astd::function
for filtering.Changes that Might be Made
Use
Traverser
forAssembler
Unfortunately, after trying to implement this myself, it's clear why @ctb didn't use it for this in the first place -- it's very geared toward agnostically walking around integer k-mers, and difficult / inefficient to extend to string traversal. This should probably change in the future, but not before tip detection is in. Update: this is now working.
Obligatory checklist (necessary for a PR to a PR?)
make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
without a major version increment. Changing file formats also requires a
major version number increment.
ChangeLog
?http://en.wikipedia.org/wiki/Changelog#Format
changes were made?
tested for streaming IO?)