Skip to content

Commit

Permalink
Merge pull request #6 from Miking98/readme
Browse files Browse the repository at this point in the history
README update with species
  • Loading branch information
exnx authored Jul 18, 2023
2 parents b529334 + f2ba0d5 commit 04c64e5
Showing 1 changed file with 25 additions and 14 deletions.
39 changes: 25 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,28 +282,26 @@ python -m train wandb=null experiment=hg38/chromatin_profile dataset.ref_genome_

### Species Classification

You'll need to download fasta files for each species that you want to use (just the .zips, the dataloader wil unzip automatically).
You'll need to download fasta files for each species that you want to use (just the .zips, the dataloader wil unzip automatically). You can download them using the following commands:


Sample species run:
```
python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=/path/to/data/species/ model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null
# Human
wget -P human/ -r -nH --cut-dirs=12 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# Lemur
wget -P lemur/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Lemur_catta/latest_assembly_versions/GCA_020740605.1_mLemCat1.pri/GCA_020740605.1_mLemCat1.pri_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# House mouse
wget -P mouse/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Mus_musculus/latest_assembly_versions/GCA_921998355.2_A_J_v3/GCA_921998355.2_A_J_v3_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# Pig
wget -P pig/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Sus_scrofa/latest_assembly_versions/GCA_002844635.1_USMARCv1.0/GCA_002844635.1_USMARCv1.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# Hippo
wget -P hippo/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Hippopotamus_amphibius/latest_assembly_versions/GCA_023065835.1_ASM2306583v1/GCA_023065835.1_ASM2306583v1_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
```

Let's break some of these args down:
- `experiment=hg38/species` # main config for this experiment
- `dataset.species` # list of species you want (and already downloaded their .fasta files)
- `decoder.mode=last` # using the last token to classify (instead of default pooling)
- `train.pretrained_model_path` # if using a pretrained model, point to it, if not, set to null
- `train.pretrained_model_state_hook=null` # if using a pretrained model, this will load the weights properly (and not head). if not, set to null


Your folder struture should look like this:

```
data
|-- species/
|-- README.txt: Contains links to where the `.fna` files were downloaded from.
|-- chimpanzee/
|-- chr1.fna
|-- chr2.fna
Expand All @@ -329,6 +327,19 @@ data
```


Sample species run:
```
python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=/path/to/data/species/ model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null
```

Let's break some of these args down:
- `experiment=hg38/species` # main config for this experiment
- `dataset.species` # list of species you want (and already downloaded their .fasta files)
- `decoder.mode=last` # using the last token to classify (instead of default pooling)
- `train.pretrained_model_path` # if using a pretrained model, point to it, if not, set to null
- `train.pretrained_model_state_hook=null` # if using a pretrained model, this will load the weights properly (and not head). if not, set to null


# More advanced stuff below


Expand Down Expand Up @@ -527,7 +538,7 @@ In the sample config, see the

Things to note:

Train dataset will change during training, but the test set will always be fixed. The test len/batch size is set the normal way in your command launch, ie, `dataset.batch_size`, `dataset.
Train dataset will change during training, but the test set will always be fixed. The test len/batch size is set the normal way in your command launch, ie, `dataset.batch_size`, `dataset`.


## Citation
Expand Down

0 comments on commit 04c64e5

Please sign in to comment.