Skip to content

Commit

Permalink
seqspec enhacements (addition of library specification + read specifi…
Browse files Browse the repository at this point in the history
…cation) (#33)

* added sequence spec to assay object. added read object to sequence spec object

* renamed assay_spec, library_spec, first implementation of seqspec index with read structure

* added changelog

* updated changelog

* updated seqspec image

* updated specification document and contribution document

* seqspec index defaults to indexing reads, select --region to index region. added more checks for seqspec check on reads. added read builder on website

* added assay builder to site

* added a lib protocol/kit and seq protocol/kit to assay. modified assay to be assay_id. changed read_name to name. changed publication_date to date. added sequencer specific read profile examples. added region examples. built read/region/assay web pages to display examples, along with style sheets and relevant javascript code to load examples. updated python classes and seqspec schema with relevant changes. added verification for assay date attribute.

* cleaning up repo, moving dev python notebook to correct location

* added barcode

* cleaned up reads, fixed tag issue with onlist in regions

* added truseq single index and novaseq

* Created using Colaboratory

* Created using Colaboratory

* seqspec print reads

* added get_seqspec by modality, renamed get_modality to get_libspec

* change lengths of random and onlist sequences

* update format for specs

* fixed min/max bug in seqspec check

* fixed element seqspec

* fixed element seqspec

* fixed element seqspec

* seqspec print joint libspec and seqspec

* Created using Colaboratory

* check that the primer id is in one of the atomic regions

* added libseq to print, i.e. jointly printing a sequence and library spec.

* added multiple checks to seqspec check, added libseq format for seqspec print

* fixed truseq naming convention, added libseq print

* fixed seqspec index and seqspec onlist to use the RegionCoordinate class

* remote onlist download with the kallisto multiple lists onlist format  (#31)

* Support older versions of matplotlib

the spines[["top", "bottom"...]] structure is a relatively recent
update, this allows working with older versions of matplotlib

* Get the test of seqspec check working again.

The refactoring of repositories to split out the example specification
yaml files means we didn't have any local files to try validating.

So I had to use the stub I had added for other tests, however it
needed some updates to be compatible with the library spec version of
the schema.

Also I did some mocking to avoid needing to create test fastq and
barcode files.

* Increase the number of Xs in the random region

The validator now checks that the length of the sequence string is
"X" * max_len characters.

* Update minimal Region tests and add minimal Read tests.

* Make some minimal tests for the seqspec print functions

* update print command to use the replacment assay_id attribute

previously it was assay

* My test assay used custom_primer which didn't have a color.

I randomly picked sea green.

* Implement downloading lists via urls

Also to work with barcode lists hosted by the DACC transparently
decompress gzip files.

The old read_list function took a filename, but I changed it to take
the onlist object so it would have access to the location attribute
to know if it should be reading locally or remotely instead of just
guessing if the filename string started with a scheme url.

* Only return the onlist filename if it a local file

Even if there's one list but it's remote we need to download it and
put into into a local file.

* Add onlist argument to specify combine barcode list file format.

Kallisto has a format where multiple barcode lists are in one file
separated by whitespace. That's different from the more common
cartisean product format where all the lists are crossed with each
other.

This adds the kallisto format as -f multi, and adds an argument for
the current version -f product, but treats it as a default.

* Fix test for project_regions_to_coordinates

* Minimally test RegionCoordinate and  project_regions_to_coordinates

* test run_onlist_region and run_onlist_read

A new accessor function was added to get onlists for the new read
objects in addition to the older by region type.

I also added some type annotations to be more clear that join_onlists
needs a list of Onlist objects to work. (Since we need the full
information to know if we need to download files)

* added Diane's contributions to CHANGELOG and slightly moodified validate_check_args so it doesn't return the errors object (which is errors related to the spec and not to the arguments supplied

* fixed cli 'options' to list out options and print default. improved typing annotation for a function

* added -s seqtype to seqspec print to help disambiguate between printing sequence_spec objects, library_spec objects, or both

* -s is actually spectype not seqtype in seqspec print, modified the CLI so that it has the options 'sequence', 'library', 'libseq' (which is both sequence and library)

* fixed the naming convention for spectype library, sequence, libseq and format png, tree, html, info, sequence in the seqsec print command

* clarified some cli descriptions, added -s specobject to seqspec onlist for specifying where to search for the onlist

* Libspec local caching (#32)

* Search for local copies of barcode files first.

This will look in the current directory on the current machine first,
Then it will search the local machine for the file at the full path portion
of the filename, as interpreted by urlparse.
Finally if the location is remote, it will assume that the file is
accessible at that remote URL.

Adding in this local first search is to make pipeline development
easier since it allows other code to handle retrieving the barcode
files for us.

* Read IGVF utils environment variables for remote authentication

This could easily be fleshed out more with more places to look for
authentication information

* The fake raw attribute really should be Bytes to match real requests

The new read_remote logic has opening the different kinds of streams
from the decompression wrapper, which meant all the streams needed to
be opened in the same binary mode

* run_onlist_region needs the region_id not region_type

* Return the local path to an already downloaded remote barcode file

Siddarth pointed out that my first implementation always
copied the barcode over to onlist_joined.txt even if there was only
one barcode file.

This uses the local override to see if the barcode file is available
and if it is and there's only one barcode file, it can return just
that barcodes filename and not go through the copying function

* add to changelog

---------

Co-authored-by: Diane Trout <diane@caltech.edu>
Co-authored-by: Diane Trout <diane@ghic.org>
  • Loading branch information
3 people authored Apr 17, 2024
1 parent 470f45a commit 9aeb894
Show file tree
Hide file tree
Showing 60 changed files with 3,205 additions and 2,243 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

Genomic library structure depends on both the assay and sequencer (and kit) used to generate and bind the assay-specific construct to the sequencing adapters to generate a sequencing library. Therefore, a `seqspec` is specific to both a genomics assay and sequencer.

A list of `seqspec` examples for multiple assays and sequencers can be found on [this website](https://igvf.github.io/seqspec/). Each `spec.yaml` describes the 5'->3' "Final library structure" for the assay and sequencer. Sequence specification files can be formatted with the `seqspec` command line tool.
A list of `seqspec` examples for multiple assays and sequencers can be found on [this website](https://igvf.github.io/seqspec/). Each `spec.yaml` describes the 5'->3' "Final library structure" for the assay and sequencer and can be extended to include sequencer-specific read annotations. Sequence specification files can be formatted with the `seqspec` command line tool.

<img alt="image" src="/docs/seqspec.png">

Expand Down
77 changes: 77 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Changelog

## [0.2.0] - 2024-02-XX

### Changed

- `seqspec index` uses primer and max length of of supplied Read
- `assay_spec` renamed `library_spec`
- Reorganize specification document
- Move contribution guidelines from `SPECIFICATION.md` to `CONTRIBUTION.md`
- Move example `Region`s from `SPECIFCATION.md` to `seqspec/docs/regions`
- `seqspec index` defaults to indexing reads, `--region` indexes regions
- Change descriptors of attributes `assay_id`, `doi`
- `Assay` attribute `assay` changed to `assay_id`
- `Read` attribute `read_name` changed to `name`
- `Assay` attribute `publication_date` changed to `date`
- `Assay` attribute `sequencer` changed to `sequence_protocol`
- `Assay` function `get_modality` changed to `get_libspec`
- `Region` function `update_attr` uses the `max_len` to generate `random` and `onlist` sequence lengths instead of `min_len`
- `get_region_by_type` changed to `get_region_by_region_type` to disambiguate between `region_type` and `sequence_type`
- `seqspec onlist` (by default) searches for onlists in the `Region`s intersected by the `Read` passed to `-r`.
- Support older versions of matplotlib by handling the `spines[["top", "bottom"...]]` structure
- Increase the number of Xs in the random region to match `max_len` for validation
- Update `seqspec print` command to use the replacement `assay_id` attribute instead of `assay`
- Implement downloading onlists via URLs and transparently decompress gzip files
- Change `read_list` function to take the `onlist` object for handling local and remote files
- Add `onlist` argument to specify combined barcode list file format (kallisto's multi-file format and default cartesian product format)

### Added

- Add `sequence_spec` in the `Assay` object
- Add `Read` object in the `sequence_spec` object
- Add `sequence_spec` to the seqspec json schema
- Add `Read` object to specification document
- Add `Read` generator to website GUI
- Add pattern matching to `date` in `Assay` (expected date format: DAY MONTH YEAR, where day is one or two numbers, month is the full named month starting with a Capital letter and year is the full year)
- Add `library_kit` to `Assay` object (kit that adds seq adapters)
- Add `library_protocol` to `Assay` object (library that generates insert)
- Add `sequence_kit` to `Assay` object
- Add website to view example `seqspec` objects
- Add `get_seqspec` to assay returns sequence structure for a given modality
- Add multiple checks to `seqspec check`
- check read modalities exist in assay modalities
- check primer ids from seqspec are unique and exists as region ids in libspec
- check that the primer id exists as an atomic region (currently a strong assumption that may be relaxed in the future)
- check properties of multiple sequence types
- `fixed` and `regions` not null incompatible
- `joined` and `regions` null incompatible
- `random` and `regions` not null incompatible
- `random` must have `sequence` of all X's
- `onlist` and `onlist` property null incompatible
- check that the min len is less than or equal to the max len
- check that the length of the sequence is between min and max len
- Note a strong assumption in `seqspec print` is that the sequence have a length equal to the `max_len` for visualization purposes
- Add `RegionCoordinate` object that maps `Region` min/max lengths to 0-indexed positions
- `seqspec onlist` searches for onlists in a `Region` based on `--region` flag
- Add type annotations for `join_onlists` to clarify it needs a list of `Onlist` objects
- Add minimal tests for `RegionCoordinate`, `project_regions_to_coordinates`, `run_onlist_region`, `run_onlist_read`, and seqspec print functions
- Add list of options to CLI for `-f FORMAT` within `seqspec onlist` and `seqspec print`
- Add `-s SEQTYPE` to `seqspec print` to disambiguate printing `sequence`, `library`, or `libseq` objects. TODO wrap `seqspec info` into `seqspec print -f info`.
- Add `-s SPECOBJECT` to `seqspec onlist`. Specify specific object `read`, `region`, or `region-type` for finding the `onlist`.
- Add fetching ability for seqspec onlist from remote with IGVF credentials (credit to @detrout)

### Removed

TODO:

- Remove `lib_struct`
- Remove `parent_id`

### Fixed

- Sequencing overlapping pairs now supported
- `seqspec check` correctly handles sequences lengths longer than the stated min/max range
- Fix test for `project_regions_to_coordinates`
- Get the test of seqspec check working again by updating the schema for the refactored example specification YAML files and mocking fastq and barcode files
- Only return the onlist filename if it's a local file, downloading remote lists when needed
47 changes: 43 additions & 4 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
## Contributing
# Contributing

Thank you for wanting to add a spec or improve `seqspec`. If you have a bug that is related to `seqspec` please create an issue.
Thank you for wanting to add a spec or improve `seqspec`. If you have a bug that is related to `seqspec` please create an issue. This document outlines the process for suggesting improvements to the `seqspec` specification and the procedure for updating the specification.

### Issues
## Issues

The issue should contain

- the `seqspec` command ran,
- the error message, and
- the `seqspec` and python version.

### Specs and code changes
## Improvements

To suggest improvements to the seqspec project please do the following:

- **Open an Issue**: For suggesting improvements, please open a new issue in the GitHub repository.
- **Describe Your Suggestion**: Clearly describe the problem and your proposed solution. Include examples and use cases where possible.
- **Engagement**: Encourage community feedback on the suggestion through comments.
- **Iterate**: Be open to iterating on your suggestion based on community feedback.

## Specs and code changes

If you'd like to add assays sequence specifications or make modifications to the `seqspec` tool please do the following:

Expand Down Expand Up @@ -75,3 +84,33 @@ git push origin cool-new-feature
5. Submit a pull request

If you are unfamiliar with pull requests, you can find more information on the [GitHub help page.](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests)

### Steps for Review

1. **Initial Review**: A maintainer will review the suggestion for completeness and relevance.
2. **Community Feedback**: A period for community feedback will follow.
3. **Final Review**: The maintainers will make a final review, considering all feedback.

### Decision Making

- Decisions will be made based on the specification's goals, community feedback, and overall impact on the `seqspec` ecosystem.

## Updating the Specification

### Approval and Merging

- Once approved, a maintainer will merge the changes into the specification.
- Major changes may require a more detailed review process or a community vote.

### Versioning and Change Log

- **Versioning**: Follow semantic versioning. Major changes result in a version bump.
- **Change Log**: Update the change log with a summary of the changes and contributors.

### Testing and Validation

- Ensure any changes are tested for compatibility and do not break existing functionality.

## Conclusion

We value your contributions and aim to make the process of improving the specification collaborative and transparent. For any questions, please contact the repository maintainers.
Loading

0 comments on commit 9aeb894

Please sign in to comment.