-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating 'preview' potential for other datasets (Spike) #1623
Comments
Already implemented and functioning:ExcelTabular data type; preview displays first few rows CSVSame as above, tabular MatplotlibWriterBasically an image Not implemented, potential dataset types that can be implementedAPIIts main method fetches from a endpoint a JSON payload Suggestion: Data payload will be json - can arrange it as on a tabular format (e.g. horziontal row is each JSON key, each row is 1st element, 2nd element, 3rd and so on) BiosequenceDataset saved in POSIX format; either as a long string of bio-data (e.g. ttccttaccc) or a tabular data form. Suggestion: Display first few characters of bio-data string (e.g. with '...' at the end) or display first few rows if its tabular; if it's not either, may be best to ignore this dataset as accomodating biodata otherwise may prove to be too much of a workload for just one specific dataset ParquetShould be tabular in nature; most common storage layouts for Parquet use a row-based format identical to ordinary tables. Suggestion: Display first few rows, as it's close resemblance to a table format Email MessageSaves email messages using the email library, data should be formatted as an object with keys like ["Subject"], ["From"], ["To"], etc. Suggestion: Display the subject, from and to - and a truncated version of the email message content GeoJSONDatasetJSON payload ("col1", "col2", "col3" attributes), with a additional attribute called 'geometry' that's an array of 'Points'. Suggestion: Display first few rows of the columns, alongside a truncated left-to-right string representation of the 'geometry' array HoloviewsWriterSaves data as an image onto a specific file Suggestion: Display the saved image, similar to how we display plotly's - a small, compressed image in the preview window JSONDataSetPer name, JSON payload and representation; several attribute keys, with corresponding values Suggestion: Display as tabular format, with first few rows (each row being 1st values across all attributes, then 2nd values, etc) with the horizontal headings / columns being the attribute. If values are objects (e.g. like a rabbit hole), can just display as {...} GMLDatasetGML seems essentially XML data (intended for geographical representation), like below Suggestion: Display first few lines of the GML (XML) lines of code, with the rest truncated GraphMLDatasetSame thing as above, but intended for graphs instead Suggestion: Similar to above, simply display first few lines. Alternatively if possible, use the XML data to construct the graph visually for the user using the specifications in that GraphML format, though this may be too much of a scope. networkx.JSONDataSetSame thing as previous JSONDataset - no discernible difference seen during investigation, though recommend to double-check FeatherDatasetSaved in POSIX format - this dataset is part of pandas (pandas.featherDataset) so should be loadable via pandas like the below in a tabular form: Suggestion: Load first few rows in tabular format and display - as indication to the user of what the data would look like (even if it's not going to be in a tabular format in its final form; it's still loadable as a pandas table, so we can at least select how we display it in preview and showcase it in a way that's most convenient for us technically) GBQQueryDataSet"GBQQueryDataSet loads data from a provided SQL query from Google BigQuery. It uses pandas.read_gbq which itself uses pandas-gbq internally to read from BigQuery table. Therefore it supports all allowed pandas options on read_gbq." (Docs). Hence data loaded from query should be obtainable as a table Suggestion: Display first few rows of SQL table GBQTableDataSet"GBQTableDataSet loads and saves data from/to Google BigQuery. It uses pandas-gbq to read and write from/to BigQuery table." (Docs) Data should be loadable as a table (e.g. with schema, data payload, etc) Suggestion: Display first few rows, using schema to compose the table columns GenericDataSet'pandas.GenericDataSet loads/saves data from/to a data file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to dynamically select the appropriate type of read/write target on a best effort basis.' (Docs). Hard to ascertain what data type will be used, as it'll be decided dynamically by pandas when the generic data is loaded in. Suggestion: Ignore, out of scope - otherwise we'd have to write several cases to accommodate all possible variations in how pandas can choose to load a generic data (whether it chooses x format, y format, etc) HDFDatasetHandles HDF files, 'HDFDataSet loads/saves data from/to a hdf file using an underlying filesystem (e.g. local, S3, GCS). It uses pandas.HDFStore to handle the hdf file.' (Docs). The file type seems similar to tabular displays, more so important considering that pandas would be used to load it. See a display of a HDF display (from a web search) below: Suggestion: If tabular, display a few rows pandas.JSONDatasetSame as previous JSON datasets pandas.ParquetDatasetSame as previous Parquet dataset SQLQueryDataset'SQLQueryDataSet loads data from a provided SQL query' (Docs). Effectively a SQL query provided by the user. Suggestion: Getting the resulting query data may be difficult based on current examination of the code base - recommendation instead is to display the query used (with truncation if it's too long) in the Preview panel. This way, the user has at least some idea of the source of the data node / how the data was fetched. SQLTableDataset'SQLTableDataSet loads data from a SQL table and saves a pandas dataframe to a table' (Docs). Suggestion: Since the data is saved as '...a pandas dataframe to a table', recommendation is to a tabular display (only a initial first few rows) pandas.XMLDataset'XMLDataSet loads/saves data from/to a XML file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the XML file.' (Docs) Dataset should be in a tabular form though (see the 'data' variable declaration in image above). Suggestion: Display as a tabular format, with a few rows (each element being the 0th, 1st, 2nd indexes in the arrays linked to each column). PickleDatasetSame as above, loading is effectively a column-arrayOfValues relationship which is translatable to a table. See image: This, combined with the description: means that the data should be displayed as tabular. Suggestion: Same as above ImageDataset'ImageDataSet loads/saves image data as numpy from an underlying filesystem (e.g.: local, S3, GCS). It uses Pillow to handle image file.' (Docs) Suggestion: Display as image plotly.JSONDatasetDisplay like previous JSON Datasets above plotly.PlotlyDataset'PlotlyDataSet generates a plot from a pandas DataFrame and saves it to a JSON file using an underlying filesystem (e.g.: local, S3, GCS). It loads the JSON into a plotly figure.' (Docs) Suggestion: Display as image format / plot graph (the plotly figure) - already done similar to MatplotlibWriter anyhow, hence same approach as one taken for the latter. redis.PickleDatasetSame as previous PickleDataset DeltaTableDataset'DeltaTableDataSet loads data into DeltaTable objects.' (Docs). From examination, seems that dataset is loaded and configured as a table display (e.g. DeltaTable). See image: Suggestion: Display first few rows as a tabular display SparkDatasetTrue to its name, 'SparkDataSet loads and saves Spark dataframes.' (Docs). Resulting output should be familiar to a table. If not, easily translatable into a table format. Here's example of a Spark dataframe online: Suggestion: Display as tabular format, first few rows SparkHiveDataset'SparkHiveDataSet loads and saves Spark dataframes stored on Hive. This data set also handles some incompatible file types such as using partitioned parquet on hive which will not normally allow upserts to existing data without a complete replacement of the existing file/partition.' (Docs). Similar to above, except it seems that Spark is pulling its original data from Hive. Suggestion: Same as above SparkJDBCDatasetSame as above, except source is from JDBC; 'SparkJDBCDataSet loads data from a database table accessible via JDBC URL url and connection properties and saves the content of a PySpark DataFrame to an external database table via JDBC' (Docs). Suggestion: Same as above - end result is the same, even if the source is different SVMLightDataSet'_SVMLightDataSet loads/saves data from/to a svmlight/libsvm file using an underlying filesystem (e.g.: local, S3, GCS). It uses sklearn functions dump_svmlight_file to save and load_svmlight_file to load a file. Data is loaded as a tuple of features and labels. Labels is NumPy array, and features is Compressed Sparse Row matrix._' (Docs) Data is similar to a table in its display from examination. See initial code below: Effectively an NP Array, whcih can be translated into a simple few rows of a table. Suggestion: Display first few rows in tabular format TensorFlowDataset'TensorflowModelDataset loads and saves TensorFlow models. The underlying functionality is supported by, and passes input arguments through to, TensorFlow 2.X load_model and save_model methods.' (Docs) Suggestion: Ignore - displaying TensorFlow results will be a series of numbers denoting the performance of each layer or ML node; will likely be difficult to format into a table or display clearly, hence out of technical scope. TextDataset'TextDataSet loads/saves data from/to a text file using an underlying filesystem (e.g.: local, S3, GCS)' (Docs) Effectively a string - see code below: Suggestion: Display the string, but truncuate it to only be limited to X number of characters (however you see first) - that way, at least the first few sentences can be displayed to the user. If string is empty, can simply display it as 'Empty file / no characters in string' etc. tracking.JSONDatasetSame as previous JSON datasets tracking.MetricsDataset'MetricsDataSet saves data to a JSON file using an underlying filesystem (e.g.: local, S3, GCS). It uses native json to handle the JSON file. The MetricsDataSet is part of Kedro Experiment Tracking. The dataset is write-only, it is versioned by default and only takes metrics of numeric values.' (Docs) Data is effectively in JSON format, hence the same approach taken for previous JSON / JSON-based datasets can be replicated here as well. Suggestion: Display first few rows, after changing / displaying it in tabular format XYZ'YAMLDataSet loads/saves data from/to a YAML file using an underlying filesystem (e.g.: local, S3, GCS). It uses PyYAML to handle the YAML file.' (Docs) Data format for this will likely be JSON. Suggestion: This will not (likely) be a large collection of data - hence, may be best to simply display it as a beautified JSON display (e.g. left to right text readings) for the user, with truncation at the bottom if it gets too long. This will keep it close to original YAML format, which is usually in a JSON / JSON-adjacent form anyway - hence in-line with user expectations and their experience. |
I think If we did Pandas, Spark (and Polars for fun) that gets us most of the tables people write today. As an added bonus is their a route for custom datasets to provide the required functionality with no extra work on the Viz side? |
GeoJSON and GML could be represented as maps with a small embed of Leaflet.js, like http://geojson.io/ |
This ticket is done. The implementation ticket for this is #1622 |
Description
Currently only two datasets in
kedro-plugins/kedro-datasets
(repo) have a 'preview' method. This collates, say for CSV, the first 5 rows to display in Viz.Investigation is to see if other datasets could have the same thing implemented for them; what should the 'preview data' look like? Will it necessarily be tabular? Is it even possible for xyz dataset?
Idea is to have a answer, after spike investigation, of all the ways those datasets could have 'preview' implemented for them.
Context
Our aim has been to extend the 'preview' method past just 2 datasets - doing this investigation now will save us a lot of time later when we finally get around to doing it. The answers / document from this issue can guide how that implementation proceeds.
Possible Implementation
No implementation, just investigating each dataset and writing down what its preview format should be (e.g. for tabular datasets, it should be the first 5 rows).
The text was updated successfully, but these errors were encountered: