Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
seqspec enhacements (addition of library specification + read specifi…
…cation) (#33) * added sequence spec to assay object. added read object to sequence spec object * renamed assay_spec, library_spec, first implementation of seqspec index with read structure * added changelog * updated changelog * updated seqspec image * updated specification document and contribution document * seqspec index defaults to indexing reads, select --region to index region. added more checks for seqspec check on reads. added read builder on website * added assay builder to site * added a lib protocol/kit and seq protocol/kit to assay. modified assay to be assay_id. changed read_name to name. changed publication_date to date. added sequencer specific read profile examples. added region examples. built read/region/assay web pages to display examples, along with style sheets and relevant javascript code to load examples. updated python classes and seqspec schema with relevant changes. added verification for assay date attribute. * cleaning up repo, moving dev python notebook to correct location * added barcode * cleaned up reads, fixed tag issue with onlist in regions * added truseq single index and novaseq * Created using Colaboratory * Created using Colaboratory * seqspec print reads * added get_seqspec by modality, renamed get_modality to get_libspec * change lengths of random and onlist sequences * update format for specs * fixed min/max bug in seqspec check * fixed element seqspec * fixed element seqspec * fixed element seqspec * seqspec print joint libspec and seqspec * Created using Colaboratory * check that the primer id is in one of the atomic regions * added libseq to print, i.e. jointly printing a sequence and library spec. * added multiple checks to seqspec check, added libseq format for seqspec print * fixed truseq naming convention, added libseq print * fixed seqspec index and seqspec onlist to use the RegionCoordinate class * remote onlist download with the kallisto multiple lists onlist format (#31) * Support older versions of matplotlib the spines[["top", "bottom"...]] structure is a relatively recent update, this allows working with older versions of matplotlib * Get the test of seqspec check working again. The refactoring of repositories to split out the example specification yaml files means we didn't have any local files to try validating. So I had to use the stub I had added for other tests, however it needed some updates to be compatible with the library spec version of the schema. Also I did some mocking to avoid needing to create test fastq and barcode files. * Increase the number of Xs in the random region The validator now checks that the length of the sequence string is "X" * max_len characters. * Update minimal Region tests and add minimal Read tests. * Make some minimal tests for the seqspec print functions * update print command to use the replacment assay_id attribute previously it was assay * My test assay used custom_primer which didn't have a color. I randomly picked sea green. * Implement downloading lists via urls Also to work with barcode lists hosted by the DACC transparently decompress gzip files. The old read_list function took a filename, but I changed it to take the onlist object so it would have access to the location attribute to know if it should be reading locally or remotely instead of just guessing if the filename string started with a scheme url. * Only return the onlist filename if it a local file Even if there's one list but it's remote we need to download it and put into into a local file. * Add onlist argument to specify combine barcode list file format. Kallisto has a format where multiple barcode lists are in one file separated by whitespace. That's different from the more common cartisean product format where all the lists are crossed with each other. This adds the kallisto format as -f multi, and adds an argument for the current version -f product, but treats it as a default. * Fix test for project_regions_to_coordinates * Minimally test RegionCoordinate and project_regions_to_coordinates * test run_onlist_region and run_onlist_read A new accessor function was added to get onlists for the new read objects in addition to the older by region type. I also added some type annotations to be more clear that join_onlists needs a list of Onlist objects to work. (Since we need the full information to know if we need to download files) * added Diane's contributions to CHANGELOG and slightly moodified validate_check_args so it doesn't return the errors object (which is errors related to the spec and not to the arguments supplied * fixed cli 'options' to list out options and print default. improved typing annotation for a function * added -s seqtype to seqspec print to help disambiguate between printing sequence_spec objects, library_spec objects, or both * -s is actually spectype not seqtype in seqspec print, modified the CLI so that it has the options 'sequence', 'library', 'libseq' (which is both sequence and library) * fixed the naming convention for spectype library, sequence, libseq and format png, tree, html, info, sequence in the seqsec print command * clarified some cli descriptions, added -s specobject to seqspec onlist for specifying where to search for the onlist * Libspec local caching (#32) * Search for local copies of barcode files first. This will look in the current directory on the current machine first, Then it will search the local machine for the file at the full path portion of the filename, as interpreted by urlparse. Finally if the location is remote, it will assume that the file is accessible at that remote URL. Adding in this local first search is to make pipeline development easier since it allows other code to handle retrieving the barcode files for us. * Read IGVF utils environment variables for remote authentication This could easily be fleshed out more with more places to look for authentication information * The fake raw attribute really should be Bytes to match real requests The new read_remote logic has opening the different kinds of streams from the decompression wrapper, which meant all the streams needed to be opened in the same binary mode * run_onlist_region needs the region_id not region_type * Return the local path to an already downloaded remote barcode file Siddarth pointed out that my first implementation always copied the barcode over to onlist_joined.txt even if there was only one barcode file. This uses the local override to see if the barcode file is available and if it is and there's only one barcode file, it can return just that barcodes filename and not go through the copying function * add to changelog --------- Co-authored-by: Diane Trout <diane@caltech.edu> Co-authored-by: Diane Trout <diane@ghic.org>
- Loading branch information