Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro-Viz to show preview of data #907

Closed
rashidakanchwala opened this issue Jun 13, 2022 · 18 comments
Closed

Kedro-Viz to show preview of data #907

rashidakanchwala opened this issue Jun 13, 2022 · 18 comments
Assignees
Labels
Issue: Feature Request Javascript Pull requests that update Javascript code

Comments

@rashidakanchwala
Copy link
Contributor

rashidakanchwala commented Jun 13, 2022

Description

Kedro-viz supports Plotly.
Plotly has cool tables -https://plotly.com/python/table/

Screenshot 2022-06-13 at 16 22 19

the idea is simply show the first 5/10 rows of the dataset on Kedro-viz

Implementation

Since we already support Plotly, this would be easy to do, we just read the first 5 rows from the data and display it as a table.

There is an argument around loading so many datasets might make kedro-viz slow. But loading only happens when metadata panel is clicked which is one dataset at a time. Also maybe on Kedro we can allow users to specify which datasets they want to preview on Kedro-viz using catalog.yml preview = true

@datajoely
Copy link
Contributor

Would love this!

One note on implementation - we need a workflow to avoid opening enormous files for no reason.

  • The situation I'm worried about is specifically pandas.CSVDataSet being 1 begillion rows long and us loading that for 5 rows of data.
  • For spark.SparkDataSet we can append a .limit(5) on there to avoid this.

@limdauto
Copy link
Collaborator

@datajoely I think we should add an optional head API to Kedro Dataset if we were to do this. This allows viz to preview beyond pandas or spark and avoid performance bottleneck. The thing that knows how to optimise head is the dataset implementation, not viz.

@datajoely
Copy link
Contributor

Yeah agreed

@antonymilne
Copy link
Contributor

I like this idea and have thought about similar schemes in the past. So since you've brought it up here, let me dump some thoughts I had before here also...

Two basic questions:

  1. is plotly the right thing to use for this? It's a good option since we have it already available, but maybe there's better libraries out there for handling tables (e.g. doesn't look like plotly would handle many hundreds of columns well? Which is not at all uncommon in a kedro pipeline)
  2. how general should we make this? As per @limdauto's comment, maybe we have a general head method that can be used for any dataset. Could we incorporate the current behaviour for matplotlib and plotly datasets into this more generic mechanism? Going beyond a dataset preview, what if I don't want to show the first n rows but would rather just show the size of the dataframe (rows and columns) in the metadata side panel? (which seems equally useful to me and maybe more practical for large dataframes)

Just using plotly for pandas and/or spark dataframes would be totally great for an MVP and to get user feedback, but I just want to brainstorm how we might want to make this more generic in the longer term.


The question of adding custom properties to datasets comes up quite a bit, e.g. #662 (put number of rows in dataset on kedro-viz), https://github.com/quantumblacklabs/private-kedro/issues/1148 (add metadata to catalog entries than can be consumed by plugins), kedro-org/kedro#1076 (very long-standing issue on how to add metadata to catalog entries). This is not just limited to kedro-viz but there's a more general kedro question of how to attach metadata to a catalog entry. Let me just focus on the kedro-viz question here though.

#662 (comment) spells out my rough idea for this: user-customisable dataset widgets. This is quite similar to the idea of kedro-viz extensions, only:

  • these widgets are shown in the metadata panel rather than a whole new screen (which has both pros and cons but basically means there's much more limited space for them)
  • widgets are lighter weight and more restricted in how they must be written (unlike an extension, it doesn't start its own server etc.)

As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset. Let me call this a "trackable".

In the future I think there should be two possible methods for this:

  • via experiment tracking - this is already work in progress. You can write code to calculate whatever trackable you like in a node and then save it to a tracking dataset. Crucially this will give you a sense of how the trackable changes between one kedro run and the next, since I should be able to go back in time and visualise the pipeline and datasets of historic runs.
  • some kind of customisable "widget" which allows me to give, in the catalog, as many trackables as I like, e.g. (completely made up example syntax)
shuttles:
    type: pandas.CSVDataSet
    filepath: ...
    viz_widgets:
        number_of_rows
        number_of_na: column1, column2, column3
        my_custom_widget

Where we supply with kedro viz a few common widgets like number_of_rows, but a user can define their own my_custom_widget also so it's very flexible. The natural place for this information to be shown on kedro viz would be the side panel on the right hand side that appears when you click on a dataset. But it would be super cool if somehow we could make the pipeline visualisation customisable with user-pluggable widgets too.

According to this scheme, previewing the first 5 rows of a dataset would be some kind of dataframe_head: {rows: 5} widget that we provide within kedro-viz. This could even be automatically applied to all the datasets of the right type. There could be some kind of marketplace for user-defined widgets (small javascript apps I guess?).

Is the idea of a marketplace of custom widgets for kedro-viz datasets a huge overkill for this? At the moment, absolutely yes. We could achieve what @rashidakanchwala's describes much more simply. And at the moment I think kedro-viz extensions would be better to work on than dataset widgets. But I think it's worth thinking about where this might end up in future though, since it might spark other people's ideas and potentially affects design decisions up front. e.g.

Also maybe on Kedro we can allow users to specify which datasets they want to preview on Kedro-viz using catalog.yml preview = true

This seems too ad-hoc and hacky to me, like the current implementation of layer which is a dataset property but only really used by kedro-viz. So if we end up with lots of such parameters I think we should consider exactly where they should live so that catalog entries don't become too bloated.

@yetudada yetudada changed the title Kedro-viz to show Data preview Kedro-Viz to show preview of data Jun 20, 2022
@yetudada
Copy link
Contributor

The exploration for seeing dataset statistics by @GabrielComymQB:

MetadataPanel_Transcoded_Datasets

@merelcht
Copy link
Member

Notes from Technical Design session:

The team discussed a possible solution to preview data in Viz both on the metadata panel and the experiment tracking panel.

Some questions raised around the goal of showing a preview:

  1. Do we want to show just a preview of the data, or perhaps insights (e.g. # of columns, mean, median..)?
  2. Should users be able to customise what is shown in such a preview?

The consensus is that just a blanket preview of showing the first 5-10 rows wouldn't be useful with all data, and thus the preview should be customisable.

Possible solution:
The solution discussed in the meeting is adding a _preview() method to datasets that specifies how data should be displayed on the Viz side. This _preview() method will be customisable so if a user doesn't like the default implementation they can override it to suit their needs. The result will be displayed in the metadata and experiment tracking panels.

A downside of this solution is that we would essentially be adding visualisation specific code to the framework side, blurring the boundaries between Kedro Viz and Kedro Framework. But the _preview() method could be useful in a jupyter flow as well.

Follow up questions/actions:

  • What types of data would the _preview() method return? What are the optimal types to display data in Viz?
  • Specifically, users have expressed the need to log CSV data, what do they want to see from this CSV data?
  • Are there any other solutions, perhaps with more of the heavy lifting on the Viz side, that would solve this issue?

@antonymilne
Copy link
Contributor

antonymilne commented Jun 23, 2022

A few more thoughts on the preview method approach. Let's say that we solve the question of what types of data preview can return (shouldn't be too hard) and are happy with this living on kedro framework as a new dataset method (I'm more sceptical here). Here's a possibly representative example of what someone might want to do:

  • for some pandas.CSVDataSets in their pipeline, show number of rows
  • for some other pandas.CSVDataSets in their pipeline, show first 5 rows

The simplest way to implement this would be for the user to write two new sorts of dataset, something like this:

class CSVDataSetWithNumberOfRows(pandas.CSVDataSet):
    def preview():
        return len(self._load())

class CSVDataSetWithHead(pandas.CSVDataSet):
    def preview():
        return self._load().head()

Then in the catalog file you need to change the relevant dataset type from pandas.CSVDataSet to path.to.CSVDataSetWithNumberOfRows and path.to.CSVDataSetWithHead.

This seems quite unsatisfactory:

  • it feels heavy-handed to require a new dataset class just to alter how preview renders in kedro-viz. The load/save behaviour of the dataset is what really matters in kedro, and that's the same for all these classes
  • it doesn't scale well: even if you want every pandas.CSVDataSet to preview the same way, you have to change the type for all your catalog entries (might eventually be solved by improvements to kedro config system)

Fundamentally I think the problem here is that datasets are not easily composed. I cannot easily "mix in" a new behaviour without creating a whole new class. @limdauto mentioned once that Dmitrii had prototyped some new component-based dataset architecture that looks more like my widgets example above. This might be a major change to how kedro datasets work though, which I don't think is on the cards for the foreseeable future.

In reality, is this a problem? Possibly not; maybe we just hard code a sensible default preview into pandas.CSVDataSet and only a few advanced users who are happy writing custom classes would even think of trying to change this. If we value a user being able to customise the preview behaviour then a dataset preview method does feel awkward to me though.

Problem is, I'm not sure I have a better alternative... Maybe hooks + a viz.yml config file somehow? Certainly this would keep the functionality on the kedro-viz side much more. Let me ponder this and write it up as an alternative proposal.

@datajoely
Copy link
Contributor

I think [tool.kedro.viz] pyproject.toml section would be helpful you know. In fact, everything in the settings modal could be pre-defined there?

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Oct 10, 2022

Hi team,

I was thinking maybe the _preview method can be in Viz as it is a viz implementation. And within the Kedro project catalog.yml we define it like below so the Viz knows how/what to handle for different datasets?

feature_engineering_output:
type: pandas.CSVDataSet
filepath: ${base_location}/04_feature/feature_importance_output.csv
layer: feature
preview :
>>enable: true
>>showRows : 5

@MerelTheisenQB , @datajoely , @tynandebold , @idanov

@datajoely
Copy link
Contributor

What about adding preview logic to the AbstractDataSet class? And then also implementing it for the pandas and spark datasets today?

pandas -> .head(5)
spark -> .limit(5).toPandas().head()

@tynandebold
Copy link
Member

Notes from Technical Design session:

  • We'll go with the use of transcoding and the @Preview symbol to denote in the catalog that this dataset will be both a normal dataset and have a preview attached to it.
  • In the Viz UI we'll only load the data on click when the metadata panel is rendered

A question: what icon would we have for a node with a data preview inside it?

  • We need to come up with a different way to show that this dataset has more information
  • If a dataset has multiple pieces of information, the icon could have some layers if there are multiple things to show

@rashidakanchwala rashidakanchwala moved this from In Progress to Done in Kedro-Viz Oct 18, 2022
@rashidakanchwala
Copy link
Contributor Author

Closing this ticket as design and implementation work for the feature is mentioned on ticket #1136

@rashidakanchwala
Copy link
Contributor Author

Update - I had a discussion with @merelcht , the preview function will be written on Kedro side. We are unsure if it's only preview, or also we share the metadata information about (number of rows/columns etc)

I am reponening this ticket as front-end design is done but there's still on going discussions around implementation

@rashidakanchwala rashidakanchwala moved this from In Progress to To Do in Kedro Framework Mar 6, 2023
@rashidakanchwala rashidakanchwala moved this from Done to Todo in Kedro-Viz Mar 6, 2023
@tynandebold tynandebold added Issue: Feature Request Python Pull requests that update Python code Javascript Pull requests that update Javascript code and removed Idea Type: Discussion Design: Research Technical Design Python Pull requests that update Python code labels Mar 6, 2023
@tynandebold
Copy link
Member

This work will touch Kedro datasets as well as the backend and frontend of Viz.

The first dataset we should add a preview method to is pandas.CSVDataSet.

For the frontend work, the design was done in #1136, so check there for reference.

@Huongg Huongg moved this from Todo to In Progress in Kedro-Viz Mar 6, 2023
@Huongg Huongg self-assigned this Mar 6, 2023
@Huongg Huongg moved this from In Progress to In Review in Kedro-Viz Mar 16, 2023
@Huongg Huongg moved this from In Review to Done in Kedro-Viz Mar 24, 2023
@github-project-automation github-project-automation bot moved this from To Do to Done in Kedro Framework Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request Javascript Pull requests that update Javascript code
Projects
Archived in project
Status: Done
Development

No branches or pull requests

8 participants