Root "data" or "datasets" #508

manzt · 2021-09-14T20:46:51Z

Beyond the reactive use-case, it would be interesting to allow for data to be referenced by ID from some root source.

{
	"data": {
		"my-id": { "url": ..., ... },
	},
    "tracks": [
   		{ "data": "my-id", ... },
        { "data": "my-id", ... },
    ],
}

This would reduce a lot of code duplication, and in gos we could "lift" data definitions to the root of the chart. We could then encode a dataframe as json and embed the data in the chart once.

gos.Track(df.gos.json())

The text was updated successfully, but these errors were encountered:

sehilyi · 2021-09-17T21:13:15Z

I see the need for supporting data reference, but I am also a bit concerned about using IDs since keeping track of multiple IDs in the grammar can make writing/maintaining specs more complex.

If we want to support this only for overlaid tracks, which I think would be the major use cases, would it address the issue if one just define the data in the parent and override it in child tracks:

{
	"alignment": "overlay",
	"data": {
		"type": "json", "values": { ... }
	},
	"tracks": [
   		{ ... }, { ... } // use the data defined by the parent
	]
}

This will not allow defining multiple datasets in the parent, but since I expect a single data is used in overlaid tracks in most cases, it might be okay to support only single data?

manzt · 2022-05-19T15:01:59Z

I see the need for supporting data reference, but I am also a bit concerned about using IDs since keeping track of multiple IDs in the grammar can make writing/maintaining specs more complex.

I think there is an argument to be made that root "datasets" allow for better re-use of Gosling specifications and easier maintenance. Users can replace a data definition in one place rather than needing to find and replace the same data definition throughout the specification (like using a variable in a programming language).

FWIW, Vega-Lite implements a top-level datasets. From the docs...

Vega-Lite supports a top-level datasets property. This can be useful when the same data should be inlined in different places in the spec. Instead of setting values inline, specify datasets at the top level and then refer to the named datasource in the rest of the spec. datasets is a mapping from name to an inline dataset.

    "datasets": {
      "somedata": [1,2,3]
    },
    "data": {
      "name": "somedata"
    }

This would reduce the size of the specifications and provide the ability to (optionally) identify datasets by a unique key. We could use this identifier to build an API to update track data on-demand (like Vega View API)

sehilyi · 2022-05-23T22:05:11Z

I think there is an argument to be made that root "datasets" allow for better re-use of Gosling specifications and easier maintenance. Users can replace a data definition in one place rather than needing to find and replace the same data definition throughout the specification (like using a variable in a programming language).

Agreed.

Do you have a specific use case in mind for re-using json data? I can think of a use case that renders 1D or 2D annotations (i.e., rule marks using JSON data) in the same way across multiple tracks. Perhaps, there are more useful/frequent use cases than this.

I also think extending this reusability functionality to other data specs (reusing data def.) or even beyond the "data" specs (reusing track/view def.) would be an interesting and useful function. (For example, #88)

ThHarbig · 2023-05-05T18:13:50Z

I ran into the same issue of having to include data sets multiple times.

For example, when having multiple stacked tracks of which two use data set A and two use data set B, I need to add the data to each track separately or define it in the view and overwrite it for one set of tracks.

Another way of reducing the number of data redefinitions could be grouping tracks or in general being more flexible with nesting tracks (related to #884)

e.g.

"tracks":[
{
  "data": {
    }
    "tracks": [
    ]
},
{
  "data": {
    }
    "tracks": [
    ]
} 
]

manzt · 2023-05-06T15:54:11Z

Another way of reducing the number of data redefinitions could be grouping tracks or in general being more flexible with nesting tracks (related to #884)

I'm open to this idea. One challenge with Gosling is that it adopted a very "flattened" spec (rather than nested fields), which in some cases removes boilerplate but others makes things difficult. We've discussed adding an encoding field to group encodings in the past as well: gosling-lang/gos#34 (comment)

sehilyi · 2023-05-07T17:41:49Z

I am open to making the track nesting more flexible, and it will be beneficial to define shared encodings across tracks as well (#88).

Current Grammar

Track nesting can happen only when overlaid tracks are stacked.

"tracks": [
   { "alignment": "overlay", "tracks": [/* multiple track defs to be overlaid */] },
   { "alignment": "overlay", "tracks": [/* multiple track defs to be overlaid */] },
   { /* track def */ },
]

But, a similar example with stacked tracks is not allowed. I think this restriction makes the spec less consistent and more complicated.

"tracks": [
   { "alignment": "stack", "tracks": [/* multiple track defs to be stacked */] },
   { "alignment": "stack", "tracks": [/* multiple track defs to be stacked */] },
   { /* track def */ },
]

This is due to the following schema, which we will need to update to provide more freedom:

gosling.js/src/core/gosling.schema.ts

Lines 53 to 56 in a087340

    
           export interface StackedTracks extends CommonViewDef, Partial<SingleTrack> { 
        
               alignment?: 'stack'; 
        
               tracks: (PartialTrack | OverlaidTracks)[]; 
        
           }

Open Issue

In theory, we could provide much more flexibility by allowing to define tracks in multiple levels beyond two levels, but I think two levels (like the example above) already provide sufficient flexibility in defining shared encoding/data.
Ideally, we can have a single data object for such shared data spec to save memory space, which we do not support.

ThHarbig · 2023-05-08T13:11:33Z

I agree that beyond two levels is probably not necessary, I think if we encounter more complicated setups it might be more intuitive to have the suggested data ID approach.

manzt mentioned this issue Sep 15, 2021

feat: serialize pd.Dataframe as csv gosling-lang/gos#63

Merged

manzt mentioned this issue May 19, 2022

Support custom chromsizes #582

Closed

sehilyi added enhancement New feature or request P? Priority needs to be decided D? Difficulty not sure labels May 23, 2022

sehilyi mentioned this issue Oct 30, 2022

Improve performance of json-type by removing deep-clones of spec/data #823

Open

1 task

ThHarbig mentioned this issue May 5, 2023

Enhancements/Bug fixes for prokaryotic summary visualization project #888

Open

manzt mentioned this issue May 8, 2023

API change (CsvData): A single way to define genomic fields #890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Root "data" or "datasets" #508

Root "data" or "datasets" #508

manzt commented Sep 14, 2021

sehilyi commented Sep 17, 2021

manzt commented May 19, 2022

sehilyi commented May 23, 2022 •

edited

Loading

ThHarbig commented May 5, 2023

manzt commented May 6, 2023

sehilyi commented May 7, 2023

ThHarbig commented May 8, 2023

Root "data" or "datasets" #508

Root "data" or "datasets" #508

Comments

manzt commented Sep 14, 2021

sehilyi commented Sep 17, 2021

manzt commented May 19, 2022

sehilyi commented May 23, 2022 • edited Loading

ThHarbig commented May 5, 2023

manzt commented May 6, 2023

sehilyi commented May 7, 2023

Current Grammar

Open Issue

ThHarbig commented May 8, 2023

sehilyi commented May 23, 2022 •

edited

Loading