diff --git a/README.md b/README.md index 79faac3..94e65e3 100644 --- a/README.md +++ b/README.md @@ -282,28 +282,26 @@ python -m train wandb=null experiment=hg38/chromatin_profile dataset.ref_genome_ ### Species Classification -You'll need to download fasta files for each species that you want to use (just the .zips, the dataloader wil unzip automatically). +You'll need to download fasta files for each species that you want to use (just the .zips, the dataloader wil unzip automatically). You can download them using the following commands: - -Sample species run: ``` -python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=/path/to/data/species/ model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null +# Human +wget -P human/ -r -nH --cut-dirs=12 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ +# Lemur +wget -P lemur/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Lemur_catta/latest_assembly_versions/GCA_020740605.1_mLemCat1.pri/GCA_020740605.1_mLemCat1.pri_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ +# House mouse +wget -P mouse/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Mus_musculus/latest_assembly_versions/GCA_921998355.2_A_J_v3/GCA_921998355.2_A_J_v3_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ +# Pig +wget -P pig/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Sus_scrofa/latest_assembly_versions/GCA_002844635.1_USMARCv1.0/GCA_002844635.1_USMARCv1.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ +# Hippo +wget -P hippo/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Hippopotamus_amphibius/latest_assembly_versions/GCA_023065835.1_ASM2306583v1/GCA_023065835.1_ASM2306583v1_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ ``` -Let's break some of these args down: -- `experiment=hg38/species` # main config for this experiment -- `dataset.species` # list of species you want (and already downloaded their .fasta files) -- `decoder.mode=last` # using the last token to classify (instead of default pooling) -- `train.pretrained_model_path` # if using a pretrained model, point to it, if not, set to null -- `train.pretrained_model_state_hook=null` # if using a pretrained model, this will load the weights properly (and not head). if not, set to null - - Your folder struture should look like this: ``` data |-- species/ - |-- README.txt: Contains links to where the `.fna` files were downloaded from. |-- chimpanzee/ |-- chr1.fna |-- chr2.fna @@ -329,6 +327,19 @@ data ``` +Sample species run: +``` +python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=/path/to/data/species/ model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null +``` + +Let's break some of these args down: +- `experiment=hg38/species` # main config for this experiment +- `dataset.species` # list of species you want (and already downloaded their .fasta files) +- `decoder.mode=last` # using the last token to classify (instead of default pooling) +- `train.pretrained_model_path` # if using a pretrained model, point to it, if not, set to null +- `train.pretrained_model_state_hook=null` # if using a pretrained model, this will load the weights properly (and not head). if not, set to null + + # More advanced stuff below @@ -527,7 +538,7 @@ In the sample config, see the Things to note: -Train dataset will change during training, but the test set will always be fixed. The test len/batch size is set the normal way in your command launch, ie, `dataset.batch_size`, `dataset. +Train dataset will change during training, but the test set will always be fixed. The test len/batch size is set the normal way in your command launch, ie, `dataset.batch_size`, `dataset`. ## Citation