Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating 'preview' potential for other datasets (Spike) #1623

Closed
MehdiNV opened this issue Nov 2, 2023 · 4 comments
Closed

Investigating 'preview' potential for other datasets (Spike) #1623

MehdiNV opened this issue Nov 2, 2023 · 4 comments
Assignees

Comments

@MehdiNV
Copy link
Contributor

MehdiNV commented Nov 2, 2023

Description

Currently only two datasets in kedro-plugins/kedro-datasets (repo) have a 'preview' method. This collates, say for CSV, the first 5 rows to display in Viz.

Investigation is to see if other datasets could have the same thing implemented for them; what should the 'preview data' look like? Will it necessarily be tabular? Is it even possible for xyz dataset?

Idea is to have a answer, after spike investigation, of all the ways those datasets could have 'preview' implemented for them.

Context

Our aim has been to extend the 'preview' method past just 2 datasets - doing this investigation now will save us a lot of time later when we finally get around to doing it. The answers / document from this issue can guide how that implementation proceeds.

Possible Implementation

No implementation, just investigating each dataset and writing down what its preview format should be (e.g. for tabular datasets, it should be the first 5 rows).

@MehdiNV
Copy link
Contributor Author

MehdiNV commented Nov 6, 2023

Already implemented and functioning:

Excel

Tabular data type; preview displays first few rows

Image

CSV

Same as above, tabular

Image

MatplotlibWriter

Basically an image

Image


Not implemented, potential dataset types that can be implemented

API

Its main method fetches from a endpoint a JSON payload

Suggestion: Data payload will be json - can arrange it as on a tabular format (e.g. horziontal row is each JSON key, each row is 1st element, 2nd element, 3rd and so on)

Image

Biosequence

Dataset saved in POSIX format; either as a long string of bio-data (e.g. ttccttaccc) or a tabular data form.

Suggestion: Display first few characters of bio-data string (e.g. with '...' at the end) or display first few rows if its tabular; if it's not either, may be best to ignore this dataset as accomodating biodata otherwise may prove to be too much of a workload for just one specific dataset

Parquet

Should be tabular in nature; most common storage layouts for Parquet use a row-based format identical to ordinary tables.

Suggestion: Display first few rows, as it's close resemblance to a table format

Email Message

Saves email messages using the email library, data should be formatted as an object with keys like ["Subject"], ["From"], ["To"], etc.

Suggestion: Display the subject, from and to - and a truncated version of the email message content

Image

GeoJSONDataset

JSON payload ("col1", "col2", "col3" attributes), with a additional attribute called 'geometry' that's an array of 'Points'.

Suggestion: Display first few rows of the columns, alongside a truncated left-to-right string representation of the 'geometry' array

Image

HoloviewsWriter

Saves data as an image onto a specific file

Suggestion: Display the saved image, similar to how we display plotly's - a small, compressed image in the preview window

Image

JSONDataSet

Per name, JSON payload and representation; several attribute keys, with corresponding values

Suggestion: Display as tabular format, with first few rows (each row being 1st values across all attributes, then 2nd values, etc) with the horizontal headings / columns being the attribute. If values are objects (e.g. like a rabbit hole), can just display as {...}

GMLDataset

GML seems essentially XML data (intended for geographical representation), like below

Image

Suggestion: Display first few lines of the GML (XML) lines of code, with the rest truncated

Image

GraphMLDataset

Same thing as above, but intended for graphs instead

Image

Suggestion: Similar to above, simply display first few lines. Alternatively if possible, use the XML data to construct the graph visually for the user using the specifications in that GraphML format, though this may be too much of a scope.

networkx.JSONDataSet

Same thing as previous JSONDataset - no discernible difference seen during investigation, though recommend to double-check

FeatherDataset

Saved in POSIX format - this dataset is part of pandas (pandas.featherDataset) so should be loadable via pandas like the below in a tabular form:

image

Suggestion: Load first few rows in tabular format and display - as indication to the user of what the data would look like (even if it's not going to be in a tabular format in its final form; it's still loadable as a pandas table, so we can at least select how we display it in preview and showcase it in a way that's most convenient for us technically)

GBQQueryDataSet

"GBQQueryDataSet loads data from a provided SQL query from Google BigQuery. It uses pandas.read_gbq which itself uses pandas-gbq internally to read from BigQuery table. Therefore it supports all allowed pandas options on read_gbq." (Docs).

Hence data loaded from query should be obtainable as a table

Suggestion: Display first few rows of SQL table

GBQTableDataSet

"GBQTableDataSet loads and saves data from/to Google BigQuery. It uses pandas-gbq to read and write from/to BigQuery table." (Docs)

Data should be loadable as a table (e.g. with schema, data payload, etc)

Suggestion: Display first few rows, using schema to compose the table columns

GenericDataSet

'pandas.GenericDataSet loads/saves data from/to a data file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to dynamically select the appropriate type of read/write target on a best effort basis.' (Docs).

Hard to ascertain what data type will be used, as it'll be decided dynamically by pandas when the generic data is loaded in.

Suggestion: Ignore, out of scope - otherwise we'd have to write several cases to accommodate all possible variations in how pandas can choose to load a generic data (whether it chooses x format, y format, etc)

HDFDataset

Handles HDF files, 'HDFDataSet loads/saves data from/to a hdf file using an underlying filesystem (e.g. local, S3, GCS). It uses pandas.HDFStore to handle the hdf file.' (Docs).

The file type seems similar to tabular displays, more so important considering that pandas would be used to load it. See a display of a HDF display (from a web search) below:
image

Suggestion: If tabular, display a few rows

pandas.JSONDataset

Same as previous JSON datasets

pandas.ParquetDataset

Same as previous Parquet dataset

SQLQueryDataset

'SQLQueryDataSet loads data from a provided SQL query' (Docs). Effectively a SQL query provided by the user.

Suggestion: Getting the resulting query data may be difficult based on current examination of the code base - recommendation instead is to display the query used (with truncation if it's too long) in the Preview panel. This way, the user has at least some idea of the source of the data node / how the data was fetched.

SQLTableDataset

'SQLTableDataSet loads data from a SQL table and saves a pandas dataframe to a table' (Docs).

Suggestion: Since the data is saved as '...a pandas dataframe to a table', recommendation is to a tabular display (only a initial first few rows)

pandas.XMLDataset

'XMLDataSet loads/saves data from/to a XML file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the XML file.' (Docs)

image

Dataset should be in a tabular form though (see the 'data' variable declaration in image above).

Suggestion: Display as a tabular format, with a few rows (each element being the 0th, 1st, 2nd indexes in the arrays linked to each column).

PickleDataset

Same as above, loading is effectively a column-arrayOfValues relationship which is translatable to a table. See image:
image

This, combined with the description:
'PickleDataSet loads/saves data from/to a Pickle file using an underlying filesystem (e.g.: local, S3, GCS). The underlying functionality is supported by the specified backend library passed in (defaults to the pickle library), so it supports all allowed options for loading and saving pickle files.' (Docs)

means that the data should be displayed as tabular.

Suggestion: Same as above

ImageDataset

'ImageDataSet loads/saves image data as numpy from an underlying filesystem (e.g.: local, S3, GCS). It uses Pillow to handle image file.' (Docs)

Suggestion: Display as image

plotly.JSONDataset

Display like previous JSON Datasets above

plotly.PlotlyDataset

'PlotlyDataSet generates a plot from a pandas DataFrame and saves it to a JSON file using an underlying filesystem (e.g.: local, S3, GCS). It loads the JSON into a plotly figure.' (Docs)

Suggestion: Display as image format / plot graph (the plotly figure) - already done similar to MatplotlibWriter anyhow, hence same approach as one taken for the latter.

redis.PickleDataset

Same as previous PickleDataset

DeltaTableDataset

'DeltaTableDataSet loads data into DeltaTable objects.' (Docs).

From examination, seems that dataset is loaded and configured as a table display (e.g. DeltaTable). See image:

image

Suggestion: Display first few rows as a tabular display

SparkDataset

True to its name, 'SparkDataSet loads and saves Spark dataframes.' (Docs).

Resulting output should be familiar to a table. If not, easily translatable into a table format. Here's example of a Spark dataframe online:
image

Suggestion: Display as tabular format, first few rows

SparkHiveDataset

'SparkHiveDataSet loads and saves Spark dataframes stored on Hive. This data set also handles some incompatible file types such as using partitioned parquet on hive which will not normally allow upserts to existing data without a complete replacement of the existing file/partition.' (Docs).

Similar to above, except it seems that Spark is pulling its original data from Hive.

image

Suggestion: Same as above

SparkJDBCDataset

Same as above, except source is from JDBC; 'SparkJDBCDataSet loads data from a database table accessible via JDBC URL url and connection properties and saves the content of a PySpark DataFrame to an external database table via JDBC' (Docs).

Suggestion: Same as above - end result is the same, even if the source is different

SVMLightDataSet

'_SVMLightDataSet loads/saves data from/to a svmlight/libsvm file using an underlying filesystem (e.g.: local, S3, GCS). It uses sklearn functions dump_svmlight_file to save and load_svmlight_file to load a file.

Data is loaded as a tuple of features and labels. Labels is NumPy array, and features is Compressed Sparse Row matrix._' (Docs)

Data is similar to a table in its display from examination. See initial code below:
image

Effectively an NP Array, whcih can be translated into a simple few rows of a table.

Suggestion: Display first few rows in tabular format

TensorFlowDataset

'TensorflowModelDataset loads and saves TensorFlow models. The underlying functionality is supported by, and passes input arguments through to, TensorFlow 2.X load_model and save_model methods.' (Docs)

Suggestion: Ignore - displaying TensorFlow results will be a series of numbers denoting the performance of each layer or ML node; will likely be difficult to format into a table or display clearly, hence out of technical scope.

TextDataset

'TextDataSet loads/saves data from/to a text file using an underlying filesystem (e.g.: local, S3, GCS)' (Docs)

Effectively a string - see code below:

image

Suggestion: Display the string, but truncuate it to only be limited to X number of characters (however you see first) - that way, at least the first few sentences can be displayed to the user. If string is empty, can simply display it as 'Empty file / no characters in string' etc.

tracking.JSONDataset

Same as previous JSON datasets

tracking.MetricsDataset

'MetricsDataSet saves data to a JSON file using an underlying filesystem (e.g.: local, S3, GCS). It uses native json to handle the JSON file. The MetricsDataSet is part of Kedro Experiment Tracking. The dataset is write-only, it is versioned by default and only takes metrics of numeric values.' (Docs)

Data is effectively in JSON format, hence the same approach taken for previous JSON / JSON-based datasets can be replicated here as well.

image

Suggestion: Display first few rows, after changing / displaying it in tabular format

XYZ

'YAMLDataSet loads/saves data from/to a YAML file using an underlying filesystem (e.g.: local, S3, GCS). It uses PyYAML to handle the YAML file.' (Docs)

image

Data format for this will likely be JSON.

Suggestion: This will not (likely) be a large collection of data - hence, may be best to simply display it as a beautified JSON display (e.g. left to right text readings) for the user, with truncation at the bottom if it gets too long. This will keep it close to original YAML format, which is usually in a JSON / JSON-adjacent form anyway - hence in-line with user expectations and their experience.

@datajoely
Copy link
Contributor

I think GenericDataSet and all Pandas dataset should be easy to preview as the output will always be a table.

If we did Pandas, Spark (and Polars for fun) that gets us most of the tables people write today.

As an added bonus is their a route for custom datasets to provide the required functionality with no extra work on the Viz side?

@astrojuanlu
Copy link
Member

GeoJSON and GML could be represented as maps with a small embed of Leaflet.js, like http://geojson.io/

@rashidakanchwala
Copy link
Contributor

This ticket is done. The implementation ticket for this is #1622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants