Eclipse Deeplearning4j: Data pipeline, DataVec Examples

This project contains a set of examples that demonstrate how raw data in various formats can be loaded, split and preprocessed to build serializable (and hence reproducible) ETL pipelines using the DataVec library.

Go back to the main repository page to explore other features/functionality of the Eclipse Deeplearning4J ecosystem. File an issue here to request new features.

The examples in this project and what they demonstrate are briefly described below. This is also the recommended order to explore them in.

Loading Data

InputSplit and its implementations are utility classes for defining and managing a catalog of loadable locations (paths/files), in memory, that can later be exposed through an Iterator. In simple terms, they define where your data comes from or should be saved to, when building a data pipeline with DataVec.

Ex01_FileSplitExample.java Using FileSplit which loads files in a given location. Constructor overloading allows for varying functionality like filtering files to load, loading recursively etc
Ex02_CollectionSplitExample.java Create a split from a collection of URIs
Ex03_NumberedFileInputSplitExample.java Create a split from numbered files, following a common pattern like file1.txt, file2.txt ... file100.txt
Ex04_TransformSplitExample.java Maps URIs of a given split to new URIs. Useful when features and labels are in different files sharing a common naming scheme, and the name of the output file can be determined given the name of the input file. Eg. a-in.csv and a-out.csv
Ex05_SamplingBaseInputSplitExample.java Generate several splits from the main split say for training, validation and testing.
Ex06_KFoldIteratorFromDataSet.java Generate a K-Fold iterator from a dataset

Cleaning, Transforming and Analysing Data

IrisCSVTransform.java A basic example that introduces users to important concepts like Schema and TransformProcess with categoricalToInteger.
CSVMixedDataTypesLocal.java Common preprocessing steps like removing unnecessary columns, filtering based on column value, replacing invalid values, parsing date time etc
CSVMixedDataTypes.java Same as the above but with Apache Spark
PrintSchemasAtEachStep.java How to print schema at each step which would be useful for debugging transform scripts in a complicated pipeline
IrisAnalysis.java Basic Analysis of the dataset saved and presented as an html file
IrisNormalizer.java Proper useage of preprocessors with min max scaler
JoinExample.java Perform joins on datasets
PivotExample.java Combine multiple independent records by key.
WebLogDataExample.java Preprocessing/aggregation operations on some web log data
CustomReduceExample.java Custom Reduction example for operations on some simple CSV data that involve a custom reduction.
MultiOpReduceExample.java Reduce example with multiple ops on one column

Formats

CSVtoMapFileConversion.java A simple example on how to convert a CSV text file to a Hadoop MapFile format for better performance and the convenience of randomization supported by the MapFileRecordReader
SVMLightExample.java MNIST SVMLight example
ImagePipelineExample.java An imagepipeline that also demonstrates using transforms to augment a small dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Eclipse Deeplearning4j: Data pipeline, DataVec Examples

Loading Data

Cleaning, Transforming and Analysing Data

Formats

Files

README.md

Latest commit

History

README.md

File metadata and controls

Eclipse Deeplearning4j: Data pipeline, DataVec Examples

Loading Data

Cleaning, Transforming and Analysing Data

Formats