Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to change busco databases? #10

Open
jungleblack007 opened this issue Apr 17, 2023 · 4 comments
Open

how to change busco databases? #10

jungleblack007 opened this issue Apr 17, 2023 · 4 comments

Comments

@jungleblack007
Copy link

For example, I want to use the agaricales_odb10 as reference database to pick single copy orthologs, how can I change the Fungi_odb10 to Agaricales_odb10?

@jungleblack007
Copy link
Author

There is another question how to calculate GSI and label them to branches?

@endixk
Copy link
Member

endixk commented Apr 18, 2023

To use the agaricales_odb10 database you will have to download and process the ODB profiles into the form that UFCG pipeline can accept.

For this, please run the following commands on your system (this may take a while):

# Download and unzip the agaricales_odb10 database
wget -q "https://busco-data.ezlab.org/v4/data/lineages/agaricales_odb10.2020-08-05.tar.gz"
tar xzf agaricales_odb10.2020-08-05.tar.gz
gzip -d agaricales_odb10/refseq_db.faa.gz

# Prepare model and sequence databases for the UFCG pipeline
cd agaricales_odb10/
ls prfl/ | cut -d. -f1 > gene_list
sed -z 's/\n/,/g;s/,$/\n/' gene_list > gene_set
mkdir -p model/pro/ seq/pro/
cat gene_list | while read I; do cp prfl/$I.prfl model/pro/$I.hmm; grep -PA1 --no-group-separator "^>$I" refseq_db.faa > seq/pro/$I.fa; done

After running above, the following command will allow you to extract agaricales_odb10 set from your sequence(s):

ufcg profile --modelpath model/ --seqpath seq/ -s $(cat gene_set) -i /path/to/input -o /path/to/output <options> 

@endixk
Copy link
Member

endixk commented Apr 18, 2023

For the second question, output of the ufcg tree module includes a Newick file named concatenated_gsi_[N].nwk, which is the very tree labeled with GSIs that you are looking for. [N] will be the number of total genes that has been considered to calculate the indices.

@jungleblack007
Copy link
Author

wow, thank you for your detailed answer, it's so great! I am trying now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants