Releases · microsoft/SynapseML

18 Jul 02:17

mhamilton723

mmlspark-v0.13

ad20556

v0.13

New Functionality:

Export trained LightGBM models for evaluation outside of Spark
LightGBM on Spark supports multiple cores per executor
CNTKModel works with multi-input multi-output models of any CNTK
datatype
Added Minibatching and Flattening transformers for adding flexible
batching logic to pipelines, deep networks, and web clients.
Added Benchmark test API for tracking model performance across
versions
Added PartitionConsolidator function for aggregating streaming data
onto one partition per executor (for use with connection/rate-limited
HTTP services)

Updates and Improvements:

Updated to Spark 2.3.0
Added Databricks notebook tests to build system
CNTKModel uses significantly less memory
Simplified example notebooks
Simplified APIs for MMLSpark Serving
Simplified APIs for CNTK on Spark
LightGBM stability improvements
ComputeModelStatistics stability improvements

Acknowledgements:

We would like to acknowledge the external contributors who helped create
this version of MMLSpark (in order of commit history):

严伟, @terrytangyuan, @ywskycn, @dvanasseldonk, Jilong Liao,
@chappers, @ekaterina-sereda-rf

Contributors

ywskycn, terrytangyuan, and 3 other contributors

Assets 2

18 Jul 02:18

elibarzilay

mmlspark-v0.11

43d2a8d

mmlspark-v0.11: v0.11

New functionality:

TuneHyperparameters: parallel distributed randomized grid search for
SparkML and TrainClassifier/TrainRegressor parameters. Sample
notebook and python wrappers will be added in the near future.
Added PowerBIWriter for writing and streaming data frames to
PowerBI.
Expanded image reading and writing capabilities, including using
images with Spark Structured Streaming. Images can be read from and
written to paths specified in a dataframe.
New functionality for convenient plotting in Python.
UDF transformer and additional UDFs.
Expanded pipeline support for arbitrary user code and libraries such
as NLTK through UDFTransformer.
Refactored fuzzing system and added test coverage.
GPU training supports multiple VMs.

Updates:

Updated to Conda 4.3.31, which comes with Python 3.6.3.
Also updated SBT and JVM.

Improvements:

Additional bugfixes, stability, and notebook improvements.

Assets 2

18 Jul 02:18

elibarzilay

mmlspark-v0.10

36740ea

mmlspark-v0.10: v0.10

New functionality:

We now provide initial support for training on a GPU VM, and an ARM
template to deploy an HDI Cluster with an associated GPU machine. See
docs/gpu-setup.md for instructions on setting this up.
New auto-generated R wrappers for estimators and transformers. To
import them into R, you can use devtools to import from the uploaded
zip file. Tests and sample notebooks to come.
A new RenameColumn transformer for renaming columns within a
pipeline.

New notebooks:

Notebook 104: An experiment to demonstrate regression models to
predict automobile prices. This notebook demonstrates the use of
Pipeline stages, CleanMissingData, and
ComputePerInstanceStatistics.
Notebook 105: Demonstrates DataConversion to make some columns Categorical.
There us a 401 notebook in notebooks/gpu which demonstrates CNTK
training when using a GPU VM. (It is not shown with the rest of the
notebooks yet.)

Updates:

Updated to use CNTK 2.2. Note that this version of CNTK depends on
libpng12 and libjasper1 -- which are included in our docker images.
(This should get resolved in the upcoming CNTK 2.3 release.)

Improvements:

Local builds will always use a "0.0" version instead of a version
based on the git repository. This should simplify the build process
for developers and avoid hard-to-resolve update issues.
The TextPreprocessor transformer can be used to find and replace all
key value pairs in an input map.
Fixed a regression in the image reader where zip files with images no
longer displayed the full path to the image inside a zip file.
Additional minor bug and stability fixes.

Assets 2

18 Jul 02:18

elibarzilay

mmlspark-v0.9

c1a08f1

v0.9

New functionality:

Refactor ImageReader and BinaryFileReader to support streaming
images, including a Python API. Also improved performance of the
readers. Check the 302 notebook for usage example.
Add ClassBalancer estimator for improving classification performance
on highly imbalanced datasets.
Create an infrastructure for automated fuzzing, serialization, and
python wrapper tests.
Added a DropColumns pipeline stage.

New notebooks:

305: A Flowers sample notebook demonstrating deep transfer learning
with ImageFeaturizer.

Updates:

Our main build is now based on Spark 2.2.

Improvements:

Enable streaming through the EnsembleByKey transformer.
ImageReader, HDFS issue, etc.

Assets 2

18 Jul 02:18

elibarzilay

mmlspark-v0.8

b61bf51

v0.8

New functionality:

We are now uploading MMLSpark as a Azure/mmlspark spark package.
Use --packages Azure:mmlspark:0.8 with the Spark command-line tools.
Add a bi-directional LSTM medical entity extractor to the
ModelDownloader, and new jupyter notebook for medical entity
extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.
Add ImageSetAugmenter for easy dataset augmentation within image
processing pipelines.

Improvements:

Optimize the performance of CNTKModel. It now broadcasts a loaded
model to workers and shares model weights between partitions on the
same worker. Minibatch padding (an internal workaround of a CNTK bug)
is now no longer used, eliminating excess computations when there is a
mismatch between the partition size and minibatch size.
Bugfix: CNTKModel can work with models with unnamed outputs.

Docker image improvements:

Environment variables are now part of the docker image (in addition to
being set in bash).
New docker images:
- microsoft/mmlspark:latest: plain image, as always,
- microsoft/mmlspark:gpu: GPU variant based on an nvidia/cuda image.
- microsoft/mmlspark:plus and microsoft/mmlspark:plus-gpu: these
  images contain additional packages for internal use; they will
  probably be based on an older Conda version too in future releases.

Updates:

The Conda environment now includes NLTK.
Updated Java and SBT versions.

Assets 2

18 Jul 02:18

elibarzilay

mmlspark-v0.7

5ea6488

v0.7

New functionality:

New transforms: EnsembleByKey, Cacher Timer; see the documentation.

Updates:

Miniconda version 4.3.21, including Python 3.6.
CNTK version 2.1, using Maven Central.
Use OpenCV from the OpenPnP project from Maven Central.

Improvements:

Spark's binaryFiles function had a regression in version 2.1 from
version 2.0 which would lead to performance issues; work around that
for now. Data frame operations after a use of BinaryFileReader (eg,
reading images) are significantly faster with this.
The Spark installation is now patched with hadoop-azure and
azure-storage.
Includes additional bug fixes and improvements.

Assets 2

18 Jul 02:03

elibarzilay

mmlspark-v0.6

bb6a495

v0.6

New functionality:

Similar to Spark's StringIndexer, we have a ValueIndexer that can
be used for indexing any type of values instead of only strings. Not
only can it index these values, we also provide a reverse mapping via
IndexToValue, similar to Spark's IndexToString transform.

A new "clean missing" data estimator, example:

val cmd = new CleanMissingData()
  .setInputCols(Array("some-column"))
  .setOutputCols(Array("some-column"))
  .setCleaningMode(CleanMissingData.customOpt)
  .setCustomValue(someCustomValue)
val cmdModel = cmd.fit(dataset)
val result = cmdModel.transform(dataset)

New default featurization for date and timestamp spark types and our
internal image type. For featurization of date columns, convert
column to double features: year, day of week, month, day of month.
For featurization of timestamp columns, same as date and in addition:
hour of day, minute of hour, second of minute. For featurization of
image columns, use image data converted to double with width and
height info.
Starting the docker image without an ACCEPT_EULA variable setting
would throw an error. Instead, we now start a tiny web server that
shows the EULA and replaces itself with the Jupyter interface when you
click the AGREE button.

Breaking changes:

Renamed ImageTransform to ImageTransformer.

Notable bug fixes and other changes:

Improved sample notebooks, and a new one: "303 - Transfer Learning by
DNN Featurization - Airplane or Automobile".
Fix serialization bugs in generated python PipelineStages.

Acknowledgments

Thanks to Ali Zaidi for some notebook beautifications.

Assets 2

18 Jul 02:03

elibarzilay

mmlspark-v0.5

70be8dd

v0.5

Initial release.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Functionality:

Updates and Improvements:

Acknowledgements:

Contributors

New functionality:

Updates:

Improvements:

New functionality:

New notebooks:

Updates:

Improvements:

New functionality:

New notebooks:

Updates:

Improvements:

New functionality:

Improvements:

Docker image improvements:

Updates:

New functionality:

Updates:

Improvements:

New functionality:

Breaking changes:

Acknowledgments

Releases: microsoft/SynapseML

v0.13

New Functionality:

Updates and Improvements:

Acknowledgements:

Contributors

mmlspark-v0.11: v0.11

New functionality:

Updates:

Improvements:

mmlspark-v0.10: v0.10

New functionality:

New notebooks:

Updates:

Improvements:

v0.9

New functionality:

New notebooks:

Updates:

Improvements:

v0.8

New functionality:

Improvements:

Docker image improvements:

Updates:

v0.7

New functionality:

Updates:

Improvements:

v0.6

New functionality:

Breaking changes:

Acknowledgments

v0.5