In a large dataset of automatically downloaded sequences there can be names including "|" symbol.
I concatenate class and train/test labels also automatically.
So, when I try to analyze this file, there are uninformative error messages like:
- ValueError: could not convert string to float: 'P42577.2'
- ValueError: invalid literal for int() with base 10: '6LPD'
which are caused by incorrect fasta headers:
- P42577.2_sp|P42577.2|FRIS_LYMST|0|training
- 6LPD_pdb|6LPD|F|1|training
A simple check when importing the file could show a warning to the user.