This project contains a set of examples that demonstrate how raw data in various formats can be loaded, split and preprocessed to build serializable (and hence reproducible) ETL pipelines using the DataVec library.
Go back to the main repository page to explore other features/functionality of the Eclipse Deeplearning4J ecosystem. File an issue here to request new features.
The examples in this project and what they demonstrate are briefly described below. This is also the recommended order to explore them in.
InputSplit and its implementations are utility classes for defining and managing a catalog of loadable locations (paths/files), in memory, that can later be exposed through an Iterator. In simple terms, they define where your data comes from or should be saved to, when building a data pipeline with DataVec.
- Ex01_FileSplitExample.java Using FileSplit which loads files in a given location. Constructor overloading allows for varying functionality like filtering files to load, loading recursively etc
- Ex02_CollectionSplitExample.java Create a split from a collection of URIs
- Ex03_NumberedFileInputSplitExample.java Create a split from numbered files, following a common pattern like file1.txt, file2.txt ... file100.txt
- Ex04_TransformSplitExample.java Maps URIs of a given split to new URIs. Useful when features and labels are in different files sharing a common naming scheme, and the name of the output file can be determined given the name of the input file. Eg. a-in.csv and a-out.csv
- Ex05_SamplingBaseInputSplitExample.java Generate several splits from the main split say for training, validation and testing.
- Ex06_KFoldIteratorFromDataSet.java Generate a K-Fold iterator from a dataset
- IrisCSVTransform.java A basic example that introduces users to important concepts like Schema and TransformProcess with categoricalToInteger.
- CSVMixedDataTypesLocal.java Common preprocessing steps like removing unnecessary columns, filtering based on column value, replacing invalid values, parsing date time etc
- CSVMixedDataTypes.java Same as the above but with Apache Spark
- PrintSchemasAtEachStep.java How to print schema at each step which would be useful for debugging transform scripts in a complicated pipeline
- IrisAnalysis.java Basic Analysis of the dataset saved and presented as an html file
- IrisNormalizer.java Proper useage of preprocessors with min max scaler
- JoinExample.java Perform joins on datasets
- PivotExample.java Combine multiple independent records by key.
- WebLogDataExample.java Preprocessing/aggregation operations on some web log data
- CustomReduceExample.java Custom Reduction example for operations on some simple CSV data that involve a custom reduction.
- MultiOpReduceExample.java Reduce example with multiple ops on one column
- CSVtoMapFileConversion.java A simple example on how to convert a CSV text file to a Hadoop MapFile format for better performance and the convenience of randomization supported by the MapFileRecordReader
- SVMLightExample.java MNIST SVMLight example
- ImagePipelineExample.java An imagepipeline that also demonstrates using transforms to augment a small dataset