Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to quickly pick the path(s), when multiple circular genome structures produced, and when SC and IR share short repeats #86

Open
Kinggerm opened this issue May 23, 2021 · 1 comment

Comments

@Kinggerm
Copy link
Owner

Kinggerm commented May 23, 2021

Background

Small repeats in the assembly graph can hamper the genome assembly. However, in some cases when we have prior knowledge of the target, e.g. the plastome structure, we may have the resolution without additional information. Here's one example.

assembly_graph 1

Supposing we achieved an plastome-sufficient assembly graph like above, we may get 8 paths using GetOrganelle with log info:

2021-05-22 23:22:22,849 - INFO: Vertex_110274 #copy = 2
2021-05-22 23:22:22,849 - INFO: Vertex_111266 #copy = 1
2021-05-22 23:22:22,850 - INFO: Vertex_111440 #copy = 3
2021-05-22 23:22:22,850 - INFO: Vertex_111452 #copy = 1
2021-05-22 23:22:22,850 - INFO: Vertex_111536 #copy = 1
2021-05-22 23:22:22,850 - INFO: Vertex_111548 #copy = 3
2021-05-22 23:22:22,850 - INFO: Vertex_111596_111392_111590 #copy = 2
2021-05-22 23:22:22,850 - INFO: Vertex_111598 #copy = 1
2021-05-22 23:22:22,850 - INFO: Vertex_111600 #copy = 2
2021-05-22 23:22:22,850 - INFO: Average embplant_pt kmer-coverage = 107.4
2021-05-22 23:22:22,850 - INFO: Average embplant_pt base-coverage = 355.5
2021-05-22 23:22:22,850 - INFO: Writing output ...
2021-05-22 23:22:22,967 - WARNING: Multiple circular genome structures produced!
2021-05-22 23:22:22,967 - WARNING: Please check the existence of those isomers by using reads mapping (library information) or longer reads.
2021-05-22 23:22:24,607 - INFO: Detecting large repeats (>1000 bp) in PATH1 with DRs detected, Total:LSC:SSC:Repeat(bp) = 165435:109819:4738:25439
2021-05-22 23:22:24,607 - INFO: Writing PATH1 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.1.path_sequence.fasta
2021-05-22 23:22:24,608 - INFO: Writing PATH2 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.2.path_sequence.fasta
2021-05-22 23:22:24,610 - INFO: Writing PATH3 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.3.path_sequence.fasta
2021-05-22 23:22:24,611 - INFO: Writing PATH4 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.4.path_sequence.fasta
2021-05-22 23:22:24,613 - INFO: Writing PATH5 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.5.path_sequence.fasta
2021-05-22 23:22:24,614 - INFO: Writing PATH6 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.6.path_sequence.fasta
2021-05-22 23:22:24,616 - INFO: Writing PATH7 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.7.path_sequence.fasta
2021-05-22 23:22:24,617 - INFO: Writing PATH8 of complete embplant_pt to issue86-output/embplant_pt.K105.complete.graph1.8.path_sequence.fasta

We use red to denote those contigs with higher coverage, which are continuous in the graph and likely to be the IR regions (check the node of 11600 and 11536). So this sample is definitely not a DR plastome, which is rare and currently only discovered in Selaginella species, and PATH1 should thus be excluded. You may further confirm this by loading the *.CSV file into Bandage, which can find that genes in red contigs are likely to be in the IRs.
assembly_graph 2

Now we have a relative clear understand of this plastome assembly graph, that the LSC region and the IR regions shared two short repeats (contig 111440 and 111548, set in orange).
assembly_graph 3

Solution 1 (laborious)

We can manually duplicate contig 111440 and 111548,
assembly_graph 4

prune the connections,
assembly_graph 5

making the graph like a typical plastome assembly graph. Then we may export the edited assembly graph and use get_organelle_from_assembly.py to generate the final two paths.

Solution 2 (command line)

However, manually adjusting assembly graph in Bandage is laborious. We could use another approach, the script plastome_arch_info.py of GetOrganelle by type in

plastome_arch_info.py issue86-output/*.fasta

then we will get

    file_name	sequence_name	total_length	LSC_length	SSC_length	IR/DR_length	arch_Notes	GC_content
issue86-output/embplant_pt.K105.complete.graph1.1.path_sequence.fasta	111536+,111600+,111548-,111266+,111440-,111452+,110274-,111440-,111596_111392_111590-,111548+,111598-,110274-,111440-,111596_111392_111590-,111548+,111600-(circular)	165435	109819	4738	25439	DRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.2.path_sequence.fasta	111536+,111600+,111548-,111596_111392_111590+,111440+,110274+,111598+,111548-,111596_111392_111590+,111440+,110274+,111452-,111440+,111266-,111548+,111600-(circular)	165435	109819	4738	25439	DRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.3.path_sequence.fasta	111598+,111548-,111266+,111440-,111452+,110274-,111440-,111596_111392_111590-,111548+,111600-,111536+,111600+,111548-,111596_111392_111590+,111440+,110274+(circular)	165435	83689	8094	36826	IRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.4.path_sequence.fasta	111598+,111548-,111266+,111440-,111452+,110274-,111440-,111596_111392_111590-,111548+,111600-,111536-,111600+,111548-,111596_111392_111590+,111440+,110274+(circular)	165435	83689	8094	36826	IRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.5.path_sequence.fasta	111536+,111600+,111548-,111266+,111440-,111596_111392_111590-,111548+,111598-,110274-,111440-,111452+,110274-,111440-,111596_111392_111590-,111548+,111600-(circular)	165435	77542	41189	23352	DRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.6.path_sequence.fasta	111536+,111600+,111548-,111596_111392_111590+,111440+,110274+,111452-,111440+,110274+,111598+,111548-,111596_111392_111590+,111440+,111266-,111548+,111600-(circular)	165435	77542	41189	23352	DRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.7.path_sequence.fasta	111536+,111600+,111548-,111596_111392_111590+,111440+,110274+,111452-,111440+,110274+,111598+,111548-,111266+,111440-,111596_111392_111590-,111548+,111600-(circular)	165435	87863	8094	34739	IRs detected	0.3667
issue86-output/embplant_pt.K105.complete.graph1.8.path_sequence.fasta	111536+,111600+,111548-,111596_111392_111590+,111440+,111266-,111548+,111598-,110274-,111440-,111452+,110274-,111440-,111596_111392_111590-,111548+,111600-(circular)	165435	87863	8094	34739	IRs detected	0.3667

where we can find that PATH3 and PATH4 has the longest IR size, which should be our final plastome result.
One may use following commands to quickly find the target file(s):

plastome_arch_info.py issue86-output/*.fasta -o plastome_arch.list
largest_size=`cat plastome_arch.list | sed -e 1d |awk 'NR==1{max=$6;next}{max=max>$6?max:$6}END{print max}'`
cat plastome_arch.list|awk '($6==size){print $1}' size=$largest_size

which can get you:

issue86-output/embplant_pt.K105.complete.graph1.3.path_sequence.fasta
issue86-output/embplant_pt.K105.complete.graph1.4.path_sequence.fasta

You may add a for-loop and a cp to quickly pick the target path(s) out.

@elena-pt
Copy link

elena-pt commented Jul 17, 2024

Good morning,

Could I please clarify with you what to do if 4 sequences with the longest IR (of the same length) are identified?
Also I am interested in question, how to quickly and automatically understand which of the 2 cases described here is implemented: the presence of symmetric and asymmetric configurations or the coincidence of short repeats in SC and IR.

Additionally, I am confused by the abundance of small "tails" along the entire length of the graph.

Thank you in advance!

Best regards,
Elena
image

plastome_arch.list.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants