Tags: KaylaTek/mmlspark
Tags
v0.13 New Functionality: * Export trained LightGBM models for evaluation outside of Spark. * LightGBM on Spark supports multiple cores per executor * `CNTKModel` works with multi-input multi-output models of any CNTK datatype * Added Minibatching and Flattening transformers for adding flexible batching logic to pipelines, deep networks, and web clients. * Added `Benchmark` test API for tracking model performance across versions * Added `PartitionConsolidator` function for aggregating streaming data onto one partition per executor (for use with connection/rate-limited HTTP services) Updates and Improvements: * Updated to Spark 2.3.0 * Added Databricks notebook tests to build system * `CNTKModel` uses significantly less memory * Simplified example notebooks * Simplified APIs for MMLSpark Serving * Simplified APIs for CNTK on Spark * LightGBM stability improvements * `ComputeModelStatistics` stability improvements Acknowledgements: We would like to acknowledge the external contributors who helped create this version of MMLSpark (in order of commit history) * 严伟, @terrytangyuan, @ywskycn, @dvanasseldonk, Jilong Liao, @chappers, @ekaterina-sereda-rf
v0.12 New functionality: * MMLSpark Serving: a RESTful computation engine built on Spark streaming. See `docs/mmlspark-serving.md` for details. * New LightGBM Binary Classification and Regression learners and infrastructure with a Python notebook for examples. * MMLSpark Clients: a general-purpose, distributed, and fault tolerant HTTP Library usable from Spark, Pyspark, and SparklyR. See `docs/http.md`. * Add `MinibatchTransformer` and `FlattenBatch` to enable efficient, buffered, minibatch processing in Spark. * Added Python wrappers and a notebook example for the `TuneHyperparameters` module, demonstrating parallel distributed hyperparameter tuning through randomized grid search. * Add a `MultiNGram` transformer for efficiently computing variable length n-grams. * Added DataType parameter for building models that are parameterized by Spark data types. Updates: * Update per-instance statistics module so it works for any Spark ML estimators. * Update CNTK to version 2.4. * Updated Spark to version v2.2.1 (the following release is likely to be based on Spark 2.3). * Also updated SBT and JVM. * Refactored readers directory into `io` directory Improvements: * Fix the Conda installation in our Docker image, resolving issues with importing `numpy`. * Fix a regression in R wrappers with the latest SparklyR version. * Additional bugfixes, stability, and notebook improvements.
v0.11 New functionality: * TuneHyperparameters: parallel distributed randomized grid search for SparkML and TrainClassifier/TrainRegressor parameters. Sample notebook and python wrappers will be added in the near future. * Added `PowerBIWriter` for writing and streaming data frames to [PowerBI](http://powerbi.microsoft.com/). * Expanded image reading and writing capabilities, including using images with Spark Structured Streaming. Images can be read from and written to paths specified in a dataframe. * New functionality for convenient plotting in Python. * UDF transformer and additional UDFs. * Expanded pipeline support for arbitrary user code and libraries such as NLTK through UDFTransformer. * Refactored fuzzing system and added test coverage. * GPU training supports multiple VMs. Updates: * Updated to Conda 4.3.31, which comes with Python 3.6.3. * Also updated SBT and JVM. Improvements: * Additional bugfixes, stability, and notebook improvements.
v0.10 New functionality: * We now provide initial support for training on a GPU VM, and an ARM template to deploy an HDI Cluster with an associated GPU machine. See `docs/gpu-setup.md` for instructions on setting this up. * New auto-generated R wrappers for estimators and transformers. To import them into R, you can use devtools to import from the uploaded zip file. Tests and sample notebooks to come. * A new `RenameColumn` transformer for renaming columns within a pipeline. New notebooks: * Notebook 104: An experiment to demonstrate regression models to predict automobile prices. This notebook demonstrates the use of `Pipeline` stages, `CleanMissingData`, and `ComputePerInstanceStatistics`. * Notebook 105: Demonstrates `DataConversion` to make some columns Categorical. * There us a 401 notebook in `notebooks/gpu` which demonstrates CNTK training when using a GPU VM. (It is not shown with the rest of the notebooks yet.) Updates: * Updated to use CNTK 2.2. Note that this version of CNTK depends on libpng12 and libjasper1 -- which are included in our docker images. (This should get resolved in the upcoming CNTK 2.3 release.) Improvements: * Local builds will always use a "0.0" version instead of a version based on the git repository. This should simplify the build process for developers and avoid hard-to-resolve update issues. * The `TextPreprocessor` transformer can be used to find and replace all key value pairs in an input map. * Fixed a regression in the image reader where zip files with images no longer displayed the full path to the image inside a zip file. * Additional minor bug and stability fixes.
v0.9 New functionality: * Refactor `ImageReader` and `BinaryFileReader` to support streaming images, including a Python API. Also improved performance of the readers. Check the 302 notebook for usage example. * Add `ClassBalancer` estimator for improving classification performance on highly imbalanced datasets. * Create an infrastructure for automated fuzzing, serialization, and python wrapper tests. * Added a `DropColumns` pipeline stage. New notebooks: * 305: A Flowers sample notebook demonstrating deep transfer learning with `ImageFeaturizer`. Updates: * Our main build is now based on Spark 2.2. Improvements: * Enable streaming through the `EnsembleByKey` transformer. * ImageReader, HDFS issue, etc.
v0.8 New functionality: * We are now uploading MMLSpark as a "Azure/mmlspark" spark package. Use `--packages Azure:mmlspark:0.8` with the Spark command-line tools. * Add a bi-directional LSTM medical entity extractor to the `ModelDownloader`, and new jupyter notebook for medical entity extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM. * Add `ImageSetAugmenter` for easy dataset augmentation within image processing pipelines. Improvements: * Optimize the performance of `CNTKModel`. It now broadcasts a loaded model to workers and shares model weights between partitions on the same worker. Minibatch padding (an internal workaround of a CNTK bug) is now no longer used, eliminating excess computations when there is a mismatch between the partition size and minibatch size. * Bugfix: CNTKModel can work with models with unnamed outputs. Docker image improvements: * Environment variables are now part of the docker image (in addition to being set in bash). * New docker images: - `microsoft/mmlspark:latest`: plain image, as always, - `microsoft/mmlspark:gpu`: GPU variant based on an `nvidia/cuda` image. - `microsoft/mmlspark:plus` and `microsoft/mmlspark:plus-gpu`: these images contain additional packages for internal use; they will probably be based on an older Conda version too in future releases. Updates: * The Conda environment now includes NLTK. * Updated Java and SBT versions.
PreviousNext