Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample Sheet Import of Datasets and Collections #4733

Closed
jmchilton opened this issue Oct 2, 2017 · 5 comments
Closed

Sample Sheet Import of Datasets and Collections #4733

jmchilton opened this issue Oct 2, 2017 · 5 comments

Comments

@jmchilton
Copy link
Member

jmchilton commented Oct 2, 2017

User Stories

This section describes user stories that progressively build up a new GUI component for creating collections from "sample sheet" inputs. This would be a two step modal (avoiding the word wizard) that would allow importing sheets of tabular data into collections of arbitrary complexity. This would allow biologists to use information potentially generated from cores directly or build structured views of their data using tools such as Excel which they are potentially most comfortable dealing with.

User Story 1

  • User is presented with an interface to upload a single file or copy/paste in a CSV.
  • User uploads a spread sheet with one column, "path" which is just the path relative to FTP.
  • Galaxy backend processes this into a JSON format for consumption by the GUI description.
  • GUI renders the tabular data and allows "rule creation for parsing it".
  • User can select column for path to file.
  • User clicks "Build" and the backend creates the requested collection.

User Story 2

  • Same as above but file:// can be used for admins, and http:// https:// ftp:// can be used for all users.

User Story 3

  • Same as above but user can specify a two columns - one for path and one for identifier.

User Story 4

  • Same as above but user can specify 3 columns - an additional one for forward/reverse and build "list:paired"s in this case.

User Story 5

  • Same as above, but user can specify any number of list identifiers to build nested structures.

User Story 6

  • Same as above but user submit a very large sheet and only the first N rows are returned and rendered in the GUI so this can work at any scale. This will also ensure we are describing "rules" via the GUI and not working with the data directly.

User Story 7

  • Same as above but the user can select a column that splits the collection into separate collections. This enables for instance nested control versus nested condition collections.

User Story 8

  • Same as above but the user can specify a column to serve as a validator for the data - such as an md5sum or a sha1sum.

User Story 9

  • Same as above but the user can specify rules to apply to a column to generate a new "pseudo" column and assign a rule to that column. For instance a regex to parse "_f" versus "_r" for forward-reverse.

User Story 10

  • Same as above but a column can be used to specify tags and annotations for the datasets.

Future Directions:

Record Dataset Collection Types

The way paired data is described above could be extended to be used with record collection types. I would see the path forward as merging the record dataset collection commit from CWL, allow tools to describe collection types they consume, allow users to fetch these type descriptions during import here and apply rules to the columns and rows in some structured way. This would also be a way to consume certain metadata from the sheet - the record descriptions allow non-data parameters the way they do in CWL.

xref #3834
xref common-workflow-lab#71

Metadata

We need to come up with ways to think about user-supplied metadata in the context of collections and outside of records I think. I say we get this practical piece done first and then start working toward that if it is a priority.

EtherCalc

There would be a couple potential uses for a Supervisor setup that always ran an EtherCalc server beside Galaxy and some permanent bridge connecting them. This could allow users to work with sample sheet data in a more "Excel-y" way before it is even imported. This GUI described here could then follow those imports and transformations.

Other Related Issues of Interest

@jmchilton
Copy link
Member Author

So I guess my tentative plan is to make this my main focus of the 18.01 development cycle - get through as many of the use cases as I can and open a PR when I feel something useful is ready to go.

This could potentially be the most significant GUI development I've done on the project - is it worth doing all of this before we make the move to a reactive UX framework? Should I delay work on this until we have made a decision and try to work within that framework with this as one of our first use cases - or should I just model it off what is in upload now?

@bgruening
Copy link
Member

For me this sounds like a new standalone app/component. Maybe you can use either Vue or React and create a single app that can be hooked up later into Galaxy?

@martenson
Copy link
Member

Is this project part of any roadmap item?

@jmchilton
Copy link
Member Author

@martenson Good question indeed!

To me this half of

[ ] Deeply nested collections #4013

the other half is either #4707 or has to follow from what we learn implementing #4707 (and apply it to the workflow editor).

The roadmap I'll admit is a bit flat - it doesn't weigh different things with different priorities with respect to collections. There is clearly lower hanging fruit with respect to collections on the roadmap. But I have out-of-band correspondences with the PIs where we discussed this and I think we all agreed the two biggest issues with collections are scaling toward usable large scale analyses are dealing with deeply nested collections and increasing the size of collections. I think that has been the case for a while and we used to think scaling size of collections was the bigger problem. But over 2016 you have the pairing the very nice presentation of Moe about how awesome it scales with James' frustrations with ChipSeq and Anton's discussion with Rotterdam where complexity and nesting were bigger issues. So I think the tide has turned a bit and the PIs and myself feel increasing the complexity of analyses that can be represented with collections is now the priority over scaling the size.

And while it may seem like the presentation of the issue here restricts the usage of this to users who have sample sheets from their sequencing core or whatever - but I think in fact some of the later user stories are important:

specify rules to apply to a column to generate a new "pseudo" column and assign a rule to that column.

So you can imagine taking any single, arbitrary directory structure of inputs and treating it as a starting column with this component. Then you can apply these rules and visualize and create nested structures from that. It would be a different view - but you'd be able to apply regex rules like with the paired list builder but then go well beyond that - nested structures, multiple collections, etc.... I think this really will have general purpose utility.

@jmchilton
Copy link
Member Author

@bgruening

For me this sounds like a new standalone app/component. Maybe you can use either Vue or React and create a single app that can be hooked up later into Galaxy?

I don't hate the idea - clearly if it was a big Python thing I'd definitely be game for that. But there is some complexity with this being a Galaxy client thing. I want it to reuse Galaxy's look and feel and the way I'm imagining it there is some interaction with the server - since the data needs to be previewed and such. I don't know that we provide a clear path for dealing with either of these things from external apps yet.

This requires significant backend development also - since we probably want to be able to upload straight into collections (we are sort of hacking around that currently by putting them into the history first) and we want to build up this language of rules to apply to "sample sheets" to build collections from them - that will probably be a complex API. In order to scale - I'm hoping to avoid just loading all the data onto the client and building an explicit structure for the collection the way the collection APIs currently work.

So I do appreciate the idea and I wish that we provided better mechanisms for doing that - but I'm not convinced we do currently and so I'll probably build it directly into Galaxy. It still being say a stand-alone component within the framework using Vue for instance sounds appealing - I'm not sure how to implement that but I can try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants