Skip to content

Tags: KaylaTek/mmlspark

Tags

v0.13

Toggle v0.13's commit message
v0.13

New Functionality:

* Export trained LightGBM models for evaluation outside of Spark.

* LightGBM on Spark supports multiple cores per executor

* `CNTKModel` works with multi-input multi-output models of any CNTK datatype

* Added Minibatching and Flattening transformers for adding flexible batching logic to pipelines, deep networks, and web clients.

* Added `Benchmark` test API for tracking model performance across versions

* Added `PartitionConsolidator` function for aggregating streaming data onto one partition per executor (for use with connection/rate-limited HTTP services)

Updates and Improvements:

* Updated to Spark 2.3.0

* Added Databricks notebook tests to build system

* `CNTKModel` uses significantly less memory

* Simplified example notebooks

* Simplified APIs for MMLSpark Serving

* Simplified APIs for CNTK on Spark

* LightGBM stability improvements

* `ComputeModelStatistics` stability improvements

Acknowledgements:

We would like to acknowledge the external contributors who helped create this version of MMLSpark (in order of commit history)

* 严伟,  @terrytangyuan, @ywskycn, @dvanasseldonk, Jilong Liao, @chappers, @ekaterina-sereda-rf

v0.12

Toggle v0.12's commit message
v0.12

New functionality:

* MMLSpark Serving: a RESTful computation engine built on Spark
  streaming.  See `docs/mmlspark-serving.md` for details.

* New LightGBM Binary Classification and Regression learners and
  infrastructure with a Python notebook for examples.

* MMLSpark Clients: a general-purpose, distributed, and fault tolerant
  HTTP Library usable from Spark, Pyspark, and SparklyR.  See
  `docs/http.md`.

* Add `MinibatchTransformer` and `FlattenBatch` to enable efficient,
  buffered, minibatch processing in Spark.

* Added Python wrappers and a notebook example for the
  `TuneHyperparameters` module, demonstrating parallel distributed
  hyperparameter tuning through randomized grid search.

* Add a `MultiNGram` transformer for efficiently computing variable
  length n-grams.

* Added DataType parameter for building models that are parameterized by
  Spark data types.

Updates:

* Update per-instance statistics module so it works for any Spark ML
  estimators.

* Update CNTK to version 2.4.

* Updated Spark to version v2.2.1 (the following release is likely to be
  based on Spark 2.3).

* Also updated SBT and JVM.

* Refactored readers directory into `io` directory

Improvements:

* Fix the Conda installation in our Docker image, resolving issues with
  importing `numpy`.

* Fix a regression in R wrappers with the latest SparklyR version.

* Additional bugfixes, stability, and notebook improvements.

v0.11

Toggle v0.11's commit message
v0.11

New functionality:

* TuneHyperparameters: parallel distributed randomized grid search for
  SparkML and TrainClassifier/TrainRegressor parameters.  Sample
  notebook and python wrappers will be added in the near future.

* Added `PowerBIWriter` for writing and streaming data frames to
  [PowerBI](http://powerbi.microsoft.com/).

* Expanded image reading and writing capabilities, including using
  images with Spark Structured Streaming.  Images can be read from and
  written to paths specified in a dataframe.

* New functionality for convenient plotting in Python.

* UDF transformer and additional UDFs.

* Expanded pipeline support for arbitrary user code and libraries such
  as NLTK through UDFTransformer.

* Refactored fuzzing system and added test coverage.

* GPU training supports multiple VMs.

Updates:

* Updated to Conda 4.3.31, which comes with Python 3.6.3.

* Also updated SBT and JVM.

Improvements:

* Additional bugfixes, stability, and notebook improvements.

v0.10.9

Toggle v0.10.9's commit message
v0.10.9

Same as v0.11, but using an older Spark v2.1.0 installation.

v0.10

Toggle v0.10's commit message
v0.10

New functionality:

* We now provide initial support for training on a GPU VM, and an ARM
  template to deploy an HDI Cluster with an associated GPU machine.  See
  `docs/gpu-setup.md` for instructions on setting this up.

* New auto-generated R wrappers for estimators and transformers.  To
  import them into R, you can use devtools to import from the uploaded
  zip file.  Tests and sample notebooks to come.

* A new `RenameColumn` transformer for renaming columns within a
  pipeline.

New notebooks:

* Notebook 104: An experiment to demonstrate regression models to
  predict automobile prices.  This notebook demonstrates the use of
  `Pipeline` stages, `CleanMissingData`, and
  `ComputePerInstanceStatistics`.

* Notebook 105: Demonstrates `DataConversion` to make some columns Categorical.

* There us a 401 notebook in `notebooks/gpu` which demonstrates CNTK
  training when using a GPU VM.  (It is not shown with the rest of the
  notebooks yet.)

Updates:

* Updated to use CNTK 2.2.  Note that this version of CNTK depends on
  libpng12 and libjasper1 -- which are included in our docker images.
  (This should get resolved in the upcoming CNTK 2.3 release.)

Improvements:

* Local builds will always use a "0.0" version instead of a version
  based on the git repository.  This should simplify the build process
  for developers and avoid hard-to-resolve update issues.

* The `TextPreprocessor` transformer can be used to find and replace all
  key value pairs in an input map.

* Fixed a regression in the image reader where zip files with images no
  longer displayed the full path to the image inside a zip file.

* Additional minor bug and stability fixes.

v0.9.9

Toggle v0.9.9's commit message
v0.9.9

Same as v0.10, but using an older Conda installation with Python 3.5.2.

v0.9

Toggle v0.9's commit message
v0.9

New functionality:

* Refactor `ImageReader` and `BinaryFileReader` to support streaming
  images, including a Python API.  Also improved performance of the
  readers.  Check the 302 notebook for usage example.

* Add `ClassBalancer` estimator for improving classification performance
  on highly imbalanced datasets.

* Create an infrastructure for automated fuzzing, serialization, and
  python wrapper tests.

* Added a `DropColumns` pipeline stage.

New notebooks:

* 305: A Flowers sample notebook demonstrating deep transfer learning
  with `ImageFeaturizer`.

Updates:

* Our main build is now based on Spark 2.2.

Improvements:

* Enable streaming through the `EnsembleByKey` transformer.

* ImageReader, HDFS issue, etc.

v0.8.9

Toggle v0.8.9's commit message
v0.8.9

Same as v0.9, but using an older Conda installation with Python 3.5.2.

v0.8

Toggle v0.8's commit message
v0.8

New functionality:

* We are now uploading MMLSpark as a "Azure/mmlspark" spark package.
  Use `--packages Azure:mmlspark:0.8` with the Spark command-line tools.

* Add a bi-directional LSTM medical entity extractor to the
  `ModelDownloader`, and new jupyter notebook for medical entity
  extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.

* Add `ImageSetAugmenter` for easy dataset augmentation within image
  processing pipelines.

Improvements:

* Optimize the performance of `CNTKModel`.  It now broadcasts a loaded
  model to workers and shares model weights between partitions on the
  same worker.  Minibatch padding (an internal workaround of a CNTK bug)
  is now no longer used, eliminating excess computations when there is a
  mismatch between the partition size and minibatch size.

* Bugfix: CNTKModel can work with models with unnamed outputs.

Docker image improvements:

* Environment variables are now part of the docker image (in addition to
  being set in bash).

* New docker images:
  - `microsoft/mmlspark:latest`: plain image, as always,
  - `microsoft/mmlspark:gpu`: GPU variant based on an `nvidia/cuda` image.
  - `microsoft/mmlspark:plus` and `microsoft/mmlspark:plus-gpu`: these
    images contain additional packages for internal use; they will
    probably be based on an older Conda version too in future releases.

Updates:

* The Conda environment now includes NLTK.

* Updated Java and SBT versions.

v0.7.91

Toggle v0.7.91's commit message
v0.7.91

Same as v0.8, but using an older Conda installation with Python 3.5.2.