-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about training a BERTax model for phylum to species taxonomy classification #10
Comments
Hi! Since we used the gene-model-structure only in the beginning of our development, more work would be required to adapt the training process for the new task. But for the genomic model, probably not a lot has to be changed. Your data would need to be in the "fragment"-type of structure, which can be generated from multifastas (https://github.com/f-kretschmer/bertax_training#multi-fastas-with-taxids). You would have to concatenate your data into a single file and adapt the header of each sequence in the following way:
....
...
The first value of the header is simply the NCBI-TaxID (https://www.ncbi.nlm.nih.gov/taxonomy), from which the classes/ranks for each species can be retrieved. The second is a running index, so that each sequence in the multi fasta has a different header. This file can then be converted with https://github.com/f-kretschmer/bertax_training/blob/master/preprocessing/fasta2fragments.py to the "fragments"-format. For training (fine-tuning), the CLI-argument |
Thanks for your suggestion! Here is what I have done:
However, I encounter another problem like this: Do you have any suggestions about my steps above? Thank you very much! |
I haven't seen this error before, could you first try to see if this error also comes up if you change back the |
Following the previous problem, I find it doesn't matter whether I use np.array(x) or not in function load_fragments(). Because I use the preprocessing.make_dataset.py to generate the train.tsv and test.tsv files. I find that the generated files are the same without using np.array(x). Here is a screenshot: Then I use the train.tsv and test.tsv to train the model with argument --use_defined_train_test_set. However, this problem still exists. I have uploaded the pre_trained model and two files here: https://drive.google.com/drive/folders/1TUSTrjlGbtYqVBcUmybAVXxLEcvG8duT?usp=sharing Very appreciate it if you can help me out 🙏 |
Sorry to bother you again, but here is a strange thing: Here is my command to run the bert_nc_finetune.py: And these are the output: I have uploaded the train_small.fasta, train_small_fragments.json, and train_small_species_picked.txt here: Could you please help me to check why? Is it because I changed the list of names? Thank you very much! |
Just a heads up and apology that I haven't been able to look into it in detail yet. I can't see anything wrong with your data or commands immediately, it might be the case that the error is related to tensorflow internals and occurring because of package version conflicts (keras-bert, which BERTax depends on, does not work with all versions of tensorflow or keras). I'll write back when I find something. |
Good to hear that changing the tensorflow version solved the first issue!
|
Thanks for your information. I found that it is true that the generator will return a list that contains tokens and segments and the segments are all 0s. However, I don't know what's wrong with the predict() function in that it doesn't unpack the list. So, I do it manually. Finally, I get the results. But it is awful. The accuracy of phylum is around 0.02. I use 1075 species' DNA to pre-train and fine-tune the model. In each species, I choose 10 sequences that do not appear in training for testing. So there are 10750 sequences for testing. Here are three logs: I find that the final losses are larger than the initial ones. Do you think I should pre-train the model? Or I should just fine-tune your pre-trained model. Thanks! |
Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this:
species_1.fasta:
species_2.fasta
species_n.faste
I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:
Thank you very much if you could provide me with some suggestions about my task!
The text was updated successfully, but these errors were encountered: