A unified analytics + AI platform for distribtued TensoFlow, Keras and BigDL on Apache Spark
Analytics Zoo provides a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline; the entire pipeline can then transparently scale out to a large Hadoop/Spark cluster for distributed training or inference.
- Data wrangling and analysis using PySpark
- Deep learning model development using TensorFlow or Keras
- Distributed training/inference on Spark and BigDL
- All within a single unified pipeline and in a user-transparent fashion!
In addition, Analytics Zoo also provides a rich set of analytics and AI support for the end-to-end pipeline, including:
- Easy-to-use abstractions and APIs (e.g., transfer learning support, autograd operations, Spark Dataframe and ML pipeline support, online model serving API, etc.)
- Common feature engineering operations (for image, text, 3D image, etc.)
- Built-in deep learning models (e.g., object detection, image classification, text classification, recommendation, etc.)
- Reference use cases (e.g., anomaly detection, sentiment analysis, fraud detection, image similarity, etc.)
-
To get started, please refer to the Python install guide or Scala install guide.
-
For running distributed TensorFlow on Spark and BigDL, please refer to the quick start here and more examples here.
-
For more information, You may refer to the Analytis Zoo document website.
-
For additional questions and discussions, you can join the Google User Group (or subscribe to the Mail List).
-
Distributed Tensoflow and Keras on Spark/BigDL
- Data wrangling and analysis using PySpark
- Deep learning model development using TensorFlow or Keras
- Distributed training/inference on Spark and BigDL
- All within a single unified pipeline and in a user-transparent fashion!
-
High level abstractions and APIs
- Transfer learning: customize pretained model for feature extraction or fine-tuning
autograd
: build custom layer/loss using auto differentiation operationsnnframes
: native deep learning support in Spark DataFrames and ML Pipelines- Model serving: productionize model serving and inference using POJO APIs
-
- Object detection API: high-level API and pretrained models (e.g., SSD and Faster-RCNN) for object detection
- Image classification API: high-level API and pretrained models (e.g., VGG, Inception, ResNet, MobileNet, etc.) for image classification
- Text classification API: high-level API and pre-defined models (using CNN, LSTM, etc.) for text classification
- Recommedation API: high-level API and pre-defined models (e.g., Neural Collaborative Filtering, Wide and Deep Learning, etc.) for recommendation
-
Reference use cases: a collection of end-to-end reference use cases (e.g., anomaly detection, sentiment analysis, fraud detection, image augmentation, object detection, variational autoencoder, etc.)
To make it easy to build and productionize the deep learning applications for Big Data, Analytics Zoo provides a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline (as illustrated below), which can then transparently run on a large-scale Hadoop/Spark clusters for distributed training and inference. (Please see more examples here).
-
Data wrangling and analysis using PySpark
from zoo import init_nncontext from zoo.pipeline.api.net import TFDataset sc = init_nncontext() #Each record in the train_rdd consists of a list of NumPy ndrrays train_rdd = sc.parallelize(file_list) .map(lambda x: read_image_and_label(x)) .map(lambda image_label: decode_to_ndarrays(image_label)) #TFDataset represents a distributed set of elements, #in which each element contains one or more Tensorflow Tensor objects. dataset = TFDataset.from_rdd(train_rdd, names=["features", "labels"], shapes=[[28, 28, 1], [1]], types=[tf.float32, tf.int32], batch_size=BATCH_SIZE)
-
Deep learning model development using TensorFlow
import tensorflow as tf slim = tf.contrib.slim images, labels = dataset.tensors labels = tf.squeeze(labels) with slim.arg_scope(lenet.lenet_arg_scope()): logits, end_points = lenet.lenet(images, num_classes=10, is_training=True) loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels))
-
Distributed training on Spark and BigDL
from zoo.pipeline.api.net import TFOptimizer from bigdl.optim.optimizer import MaxIteration, Adam, MaxEpoch, TrainSummary optimizer = TFOptimizer(loss, Adam(1e-3)) optimizer.set_train_summary(TrainSummary("/tmp/az_lenet", "lenet")) optimizer.optimize(end_trigger=MaxEpoch(5))
-
Alternatively, using Keras APIs for model development and distribtued training
from zoo.pipeline.api.keras.models import * from zoo.pipeline.api.keras.layers import * model = Sequential() model.add(Reshape((1, 28, 28), input_shape=(28, 28, 1))) model.add(Convolution2D(6, 5, 5, activation="tanh", name="conv1_5x5")) model.add(MaxPooling2D()) model.add(Convolution2D(12, 5, 5, activation="tanh", name="conv2_5x5")) model.add(MaxPooling2D()) model.add(Flatten()) model.add(Dense(100, activation="tanh", name="fc1")) model.add(Dense(class_num, activation="softmax", name="fc2")) model.compile(loss='sparse_categorical_crossentropy', optimizer='adam') model.fit(train_rdd, batch_size=BATCH_SIZE, nb_epoch=5)
Analytics Zoo provides a set of easy-to-use, high level abstractions and APIs that natively transfer learning, autograd and custom layer/loss, Spark DataFrames and ML Pipelines, online model serving, etc. etc.
Using the high level transfer learning APIs, you can easily customize pretrained models for feature extraction or fine-tuning. (See more details here)
-
Load an existing model (pretrained in Caffe)
from zoo.pipeline.api.net import * full_model = Net.load_caffe(def_path, model_path)
-
Remove the last few layers
# create a new model by removing layers after pool5/drop_7x7_s1 model = full_model.new_graph(["pool5/drop_7x7_s1"])
-
Freeze the first few layers
# freeze layers from input to pool4/3x3_s2 inclusive model.freeze_up_to(["pool4/3x3_s2"])
-
Add a few new layers
from zoo.pipeline.api.keras.layers import * from zoo.pipeline.api.keras.models import * inputs = Input(name="input", shape=(3, 224, 224)) inception = model.to_keras()(inputs) flatten = Flatten()(inception) logits = Dense(2)(flatten) newModel = Model(inputs, logits)
autograd
provides automatic differentiation for math operations, so that you can easily build your own custom loss and layer (in both Python and Scala), as illustracted below. (See more details here)
-
Define model using Keras-style API and
autograd
import zoo.pipeline.api.autograd as A from zoo.pipeline.api.keras.layers import * from zoo.pipeline.api.keras.models import * input = Input(shape=[2, 20]) features = TimeDistributed(layer=Dense(30))(input) f1 = features.index_select(1, 0) f2 = features.index_select(1, 1) diff = A.abs(f1 - f2) model = Model(input, diff)
-
Optionally define custom loss function using
autograd
def mean_absolute_error(y_true, y_pred): return mean(abs(y_true - y_pred), axis=1)
-
Train model with custom loss function
model.compile(optimizer=SGD(), loss=mean_absolute_error) model.fit(x=..., y=...)
nnframes
provides native deep learning support in Spark DataFrames and ML Pipelines, so that you can easily build complex deep learning pipelines in just a few lines, as illustrated below. (See more details here)
-
Initialize NNContext and load images into DataFrames using
NNImageReader
from zoo.common.nncontext import * from zoo.pipeline.nnframes import * from zoo.feature.image import * sc = init_nncontext() imageDF = NNImageReader.readImages(image_path, sc)
-
Process loaded data using DataFrames transformations
getName = udf(lambda row: ...) getLabel = udf(lambda name: ...) df = imageDF.withColumn("name", getName(col("image"))).withColumn("label", getLabel(col('name')))
-
Processing image using built-in feature engineering operations
transformer = RowToImageFeature() -> ImageResize(64, 64) -> ImageChannelNormalize(123.0, 117.0, 104.0) \ -> ImageMatToTensor() -> ImageFeatureToTensor())
-
Define model using Keras-style APIs
from zoo.pipeline.api.keras.layers import * from zoo.pipeline.api.keras.models import * model = Sequential().add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1, 28, 28))) \ .add(MaxPooling2D(pool_size=(2, 2))).add(Flatten()).add(Dense(10, activation='softmax')))
-
Train model using Spark ML Pipelines
classifier = NNClassifier(model, CrossEntropyCriterion(),transformer).setLearningRate(0.003) \ .setBatchSize(40).setMaxEpoch(1).setFeaturesCol("image").setCachingSample(False) nnModel = classifier.fit(df)
Using the POJO model serving API, you can productionize model serving and infernece in any Java based frameworks (e.g., Spring Framework, Apache Storm, Kafka or Flink, etc.), as illustrated below:
import com.intel.analytics.zoo.pipeline.inference.AbstractInferenceModel;
import com.intel.analytics.zoo.pipeline.inference.JTensor;
public class TextClassificationModel extends AbstractInferenceModel {
public TextClassificationModel() {
super();
}
}
TextClassificationModel model = new TextClassificationModel();
model.load(modelPath, weightPath);
List<JTensor> inputs = preprocess(...);
List<List<JTensor>> result = model.predict(inputs);
...
Analytics Zoo provides several built-in deep learning models that you can use for a variety of problem types, such as object detection, image classification, text classification, recommendation, etc.
Using Analytics Zoo Object Detection API (including a set of pretrained detection models such as SSD and Faster-RCNN), you can easily build your object detection applications (e.g., localizing and identifying multiple objects in images and videos), as illustrated below. (See more details here)
-
Download object detection models in Analytics Zoo
You can download a collection of detection models (pretrained on the PSCAL VOC dataset and COCO dataset) from detection model zoo.
-
Use Object Detection API for off-the-shell inference
from zoo.models.image.objectdetection import * model = ObjectDetector.load_model(model_path) image_set = ImageSet.read(img_path, sc) output = model.predict_image_set(image_set)
Using Analytics Zoo Image Classification API (including a set of pretrained detection models such as VGG, Inception, ResNet, MobileNet, etc.), you can easily build your image classification applications, as illustrated below. (See more details here)
-
Download image classification models in Analytics Zoo
You can download a collection of image classification models (pretrained on the ImageNet dataset) from image classification model zoo.
-
Use Image classification API for off-the-shell inference
from zoo.models.image.imageclassification import * model = ImageClassifier.load_model(model_path) image_set = ImageSet.read(img_path, sc) output = model.predict_image_set(image_set)
Analytics Zoo Text Classification API provides a set of pre-defined models (using CNN, LSTM, etc.) for text classifications. (See more details here)
Analytics Zoo Recommendation API provides a set of pre-defined models (such as Neural Collaborative Filtering, Wide and Deep Learning, etc.) for recommendations. (See more details here)
Analytics Zoo provides a collection of end-to-end reference use cases, including time series anomaly detection, sentiment analysis, fraud detection, image similarity, etc. (See more details here)