Releases: microsoft/SynapseML
v0.13
New Functionality:
-
Export trained LightGBM models for evaluation outside of Spark
-
LightGBM on Spark supports multiple cores per executor
-
CNTKModel
works with multi-input multi-output models of any CNTK
datatype -
Added Minibatching and Flattening transformers for adding flexible
batching logic to pipelines, deep networks, and web clients. -
Added
Benchmark
test API for tracking model performance across
versions -
Added
PartitionConsolidator
function for aggregating streaming data
onto one partition per executor (for use with connection/rate-limited
HTTP services)
Updates and Improvements:
-
Updated to Spark 2.3.0
-
Added Databricks notebook tests to build system
-
CNTKModel
uses significantly less memory -
Simplified example notebooks
-
Simplified APIs for MMLSpark Serving
-
Simplified APIs for CNTK on Spark
-
LightGBM stability improvements
-
ComputeModelStatistics
stability improvements
Acknowledgements:
We would like to acknowledge the external contributors who helped create
this version of MMLSpark (in order of commit history):
- 严伟, @terrytangyuan, @ywskycn, @dvanasseldonk, Jilong Liao,
@chappers, @ekaterina-sereda-rf
mmlspark-v0.11: v0.11
New functionality:
-
TuneHyperparameters: parallel distributed randomized grid search for
SparkML and TrainClassifier/TrainRegressor parameters. Sample
notebook and python wrappers will be added in the near future. -
Added
PowerBIWriter
for writing and streaming data frames to
PowerBI. -
Expanded image reading and writing capabilities, including using
images with Spark Structured Streaming. Images can be read from and
written to paths specified in a dataframe. -
New functionality for convenient plotting in Python.
-
UDF transformer and additional UDFs.
-
Expanded pipeline support for arbitrary user code and libraries such
as NLTK through UDFTransformer. -
Refactored fuzzing system and added test coverage.
-
GPU training supports multiple VMs.
Updates:
-
Updated to Conda 4.3.31, which comes with Python 3.6.3.
-
Also updated SBT and JVM.
Improvements:
- Additional bugfixes, stability, and notebook improvements.
mmlspark-v0.10: v0.10
New functionality:
-
We now provide initial support for training on a GPU VM, and an ARM
template to deploy an HDI Cluster with an associated GPU machine. See
docs/gpu-setup.md
for instructions on setting this up. -
New auto-generated R wrappers for estimators and transformers. To
import them into R, you can use devtools to import from the uploaded
zip file. Tests and sample notebooks to come. -
A new
RenameColumn
transformer for renaming columns within a
pipeline.
New notebooks:
-
Notebook 104: An experiment to demonstrate regression models to
predict automobile prices. This notebook demonstrates the use of
Pipeline
stages,CleanMissingData
, and
ComputePerInstanceStatistics
. -
Notebook 105: Demonstrates
DataConversion
to make some columns Categorical. -
There us a 401 notebook in
notebooks/gpu
which demonstrates CNTK
training when using a GPU VM. (It is not shown with the rest of the
notebooks yet.)
Updates:
- Updated to use CNTK 2.2. Note that this version of CNTK depends on
libpng12 and libjasper1 -- which are included in our docker images.
(This should get resolved in the upcoming CNTK 2.3 release.)
Improvements:
-
Local builds will always use a "0.0" version instead of a version
based on the git repository. This should simplify the build process
for developers and avoid hard-to-resolve update issues. -
The
TextPreprocessor
transformer can be used to find and replace all
key value pairs in an input map. -
Fixed a regression in the image reader where zip files with images no
longer displayed the full path to the image inside a zip file. -
Additional minor bug and stability fixes.
v0.9
New functionality:
-
Refactor
ImageReader
andBinaryFileReader
to support streaming
images, including a Python API. Also improved performance of the
readers. Check the 302 notebook for usage example. -
Add
ClassBalancer
estimator for improving classification performance
on highly imbalanced datasets. -
Create an infrastructure for automated fuzzing, serialization, and
python wrapper tests. -
Added a
DropColumns
pipeline stage.
New notebooks:
- 305: A Flowers sample notebook demonstrating deep transfer learning
withImageFeaturizer
.
Updates:
- Our main build is now based on Spark 2.2.
Improvements:
-
Enable streaming through the
EnsembleByKey
transformer. -
ImageReader, HDFS issue, etc.
v0.8
New functionality:
-
We are now uploading MMLSpark as a
Azure/mmlspark
spark package.
Use--packages Azure:mmlspark:0.8
with the Spark command-line tools. -
Add a bi-directional LSTM medical entity extractor to the
ModelDownloader
, and new jupyter notebook for medical entity
extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM. -
Add
ImageSetAugmenter
for easy dataset augmentation within image
processing pipelines.
Improvements:
-
Optimize the performance of
CNTKModel
. It now broadcasts a loaded
model to workers and shares model weights between partitions on the
same worker. Minibatch padding (an internal workaround of a CNTK bug)
is now no longer used, eliminating excess computations when there is a
mismatch between the partition size and minibatch size. -
Bugfix: CNTKModel can work with models with unnamed outputs.
Docker image improvements:
-
Environment variables are now part of the docker image (in addition to
being set in bash). -
New docker images:
microsoft/mmlspark:latest
: plain image, as always,microsoft/mmlspark:gpu
: GPU variant based on annvidia/cuda
image.microsoft/mmlspark:plus
andmicrosoft/mmlspark:plus-gpu
: these
images contain additional packages for internal use; they will
probably be based on an older Conda version too in future releases.
Updates:
-
The Conda environment now includes NLTK.
-
Updated Java and SBT versions.
v0.7
New functionality:
- New transforms:
EnsembleByKey
,Cacher
Timer
; see the documentation.
Updates:
-
Miniconda version 4.3.21, including Python 3.6.
-
CNTK version 2.1, using Maven Central.
-
Use OpenCV from the OpenPnP project from Maven Central.
Improvements:
-
Spark's
binaryFiles
function had a regression in version 2.1 from
version 2.0 which would lead to performance issues; work around that
for now. Data frame operations after a use ofBinaryFileReader
(eg,
reading images) are significantly faster with this. -
The Spark installation is now patched with
hadoop-azure
and
azure-storage
. -
Includes additional bug fixes and improvements.
v0.6
New functionality:
-
Similar to Spark's
StringIndexer
, we have aValueIndexer
that can
be used for indexing any type of values instead of only strings. Not
only can it index these values, we also provide a reverse mapping via
IndexToValue
, similar to Spark'sIndexToString
transform. -
A new "clean missing" data estimator, example:
val cmd = new CleanMissingData() .setInputCols(Array("some-column")) .setOutputCols(Array("some-column")) .setCleaningMode(CleanMissingData.customOpt) .setCustomValue(someCustomValue) val cmdModel = cmd.fit(dataset) val result = cmdModel.transform(dataset)
-
New default featurization for date and timestamp spark types and our
internal image type. For featurization of date columns, convert
column to double features: year, day of week, month, day of month.
For featurization of timestamp columns, same as date and in addition:
hour of day, minute of hour, second of minute. For featurization of
image columns, use image data converted to double with width and
height info. -
Starting the docker image without an
ACCEPT_EULA
variable setting
would throw an error. Instead, we now start a tiny web server that
shows the EULA and replaces itself with the Jupyter interface when you
click theAGREE
button.
Breaking changes:
- Renamed
ImageTransform
toImageTransformer
.
Notable bug fixes and other changes:
-
Improved sample notebooks, and a new one: "303 - Transfer Learning by
DNN Featurization - Airplane or Automobile". -
Fix serialization bugs in generated python
PipelineStage
s.
Acknowledgments
Thanks to Ali Zaidi for some notebook beautifications.
v0.5
Initial release.