-
Notifications
You must be signed in to change notification settings - Fork 129
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
io/metadata: Add
read_metadata_with_sequences
Reads a metadata file with `read_table_to_dict` and updates each record dict with the corresponding sequence from a FASTA file. The FASTA file is read with `pyfastx.Fasta` to create an index for the file to allow for random access of sequences using their sequence id. To ensure that the sequences can be matched with their metadata, the FASTA headers must contain the matching sequence id. The FASTA headers may contain additional description parts after the id, but they will not be used in the matching process. `pyfastx` currently only works for plain and gzipped files, so it will limit the input formats for FASTA files. I think this is acceptable for now instead of building our own indexing library. We can look into extending `pyfastx` to support xz-compressed files in the future. Only complete records with both metadata and a sequence are processed, the rest are skipped.
- Loading branch information
1 parent
30f9ce9
commit 1fbe479
Showing
3 changed files
with
144 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters