Merge pull request #1817 from shauheen/release/v0.8RC2

Cherry-pick for release 0.8
dotnet · Dec 4, 2018 · 51ea627 · 51ea627
2 parents 2e8c723 + 7e3edd5
commit 51ea627
Show file tree

Hide file tree

Showing 19 changed files with 778 additions and 101 deletions.
diff --git a/docs/release-notes/0.8/dataPreview.gif b/docs/release-notes/0.8/dataPreview.gif
diff --git a/docs/release-notes/0.8/release-0.8.md b/docs/release-notes/0.8/release-0.8.md
@@ -0,0 +1,126 @@
+# ML.NET 0.8 Release Notes
+
+Today we are excited to release ML.NET 0.8 and we can finally explain why it
+is the best version so far! This release enables model explainability to
+understand which features (inputs) are most important, improved debuggability,
+easier to use time series predictions, several API improvements, a new
+recommendation use case, and more.
+
+### Installation
+
+ML.NET supports Windows, MacOS, and Linux. See [supported OS versions of .NET
+Core
+2.0](https://github.com/dotnet/core/blob/master/release-notes/2.0/2.0-supported-os.md)
+for more details.
+
+You can install ML.NET NuGet from the CLI using:
+```
+dotnet add package Microsoft.ML
+```
+
+From package manager:
+```
+Install-Package Microsoft.ML
+```
+
+### Release Notes
+
+Below are some of the highlights from this release.
+
+* Added first steps towards model explainability
+  ([#1735](https://github.com/dotnet/machinelearning/pull/1735),
+  [#1692](https://github.com/dotnet/machinelearning/pull/1692))
+
+    * Enabled explainability in the form of overall feature importance and
+      generalized additive models. 
+    * Overall feature importance gives a sense of which features are overall
+      most important for the model. For example, when predicting the sentiment
+      of a tweet, the presence of "amazing" might be more important than
+      whether the tweet contains "bird". This is enabled through Permutation
+      Feature Importance. Example usage can be found
+      [here](https://github.com/dotnet/machinelearning/blob/3d33e20f33da70cdd3da2ad9e0b2b03df929bef4/docs/samples/Microsoft.ML.Samples/Dynamic/PermutationFeatureImportance.cs).
+    * Generalized Additive Models have very explainable predictions. They are
+      similar to linear models in terms of ease of understanding but are more
+      flexible and can have better performance. Example usage can be found
+      [here](https://github.com/dotnet/machinelearning/blob/3d33e20f33da70cdd3da2ad9e0b2b03df929bef4/docs/samples/Microsoft.ML.Samples/Dynamic/GeneralizedAdditiveModels.cs).
+
+* Improved debuggability by previewing IDataViews
+  ([#1518](https://github.com/dotnet/machinelearning/pull/1518))
+
+    * It is often useful to peek at the data that is read into an ML.NET
+      pipeline and even look at it after some intermediate steps to ensure the
+      data is transformed as expected. 
+    * You can now preview an IDataView by going to the Watch window in the VS
+      debugger, entering a variable name you want to preview and calling its
+      `Preview()` method. 
+
+    ![](dataPreview.gif)
+
+* Enabled a stateful prediction engine for time series problems
+  ([#1727](https://github.com/dotnet/machinelearning/pull/1727))
+
+    * [ML.NET
+      0.7](https://github.com/dotnet/machinelearning/blob/483ec04a11fbdc056a88bc581d7e5cee9092a936/docs/release-notes/0.7/release-0.7.md)
+      enabled anomaly detection scenarios. However, the prediction engine was
+      stateless, which means that every time you want to figure out whether
+      the latest data point is anomolous, you need to provide historical data
+      as well. This is unnatural.
+    * The prediction engine can now keep state of time series data seen so
+      far, so you can now get predictions by just providing the latest data
+      point. This is enabled by using `CreateTimeSeriesPredictionFunction`
+      instead of `MakePredictionFunction`. Example usage can be found
+      [here](https://github.com/dotnet/machinelearning/blob/3d33e20f33da70cdd3da2ad9e0b2b03df929bef4/test/Microsoft.ML.TimeSeries.Tests/TimeSeriesDirectApi.cs#L141).
+      You'll need to add the Microsoft.ML.TimeSeries NuGet to your project.
+
+* Improved support for recommendation scenarios with implicit feedback
+  ([#1664](https://github.com/dotnet/machinelearning/pull/1664))  
+
+    * [ML.NET
+      0.7](https://github.com/dotnet/machinelearning/blob/483ec04a11fbdc056a88bc581d7e5cee9092a936/docs/release-notes/0.7/release-0.7.md)
+      included Matrix Factorization which enables using ratings provided by
+      users to recommend other items they might like.
+    * In some cases, you don't have specific ratings from users but only
+      implicit feedback (e.g. they watched the movie but didn't rate it).
+    * Matrix Factorization in ML.NET can now use this type of implicit data to
+      train models for recommendation scenarios. 
+    * Example usage can be found
+      [here](https://github.com/dotnet/machinelearning/blob/71d58fa83f77abb630d815e5cf8aa9dd3390aa65/test/Microsoft.ML.Tests/TrainerEstimators/MatrixFactorizationTests.cs#L335).
+      You'll need to add the Microsoft.ML.MatrixFactorization NuGet to your
+      project.
+
+* Enabled saving and loading data as a binary file (IDataView/IDV)
+  ([#1678](https://github.com/dotnet/machinelearning/pull/1678))
+
+    * It is sometimes useful to save data after it has been transformed. For
+      example, you might have featurized all the text into sparse vectors and
+      want to perform repeated experimentation with different trainers without
+      continuously repeating the data transformation.
+    * Saving and loading files in ML.NET's binary format can help efficiency
+      as it is compressed and already schematized.
+    * Reading a binary data file can be done using
+      `mlContext.Data.ReadFromBinary("pathToFile")` and writing a binary data
+      file can be done using `mlContext.Data.SaveAsBinary("pathToFile")`.
+
+* Added filtering and caching APIs
+  ([#1569](https://github.com/dotnet/machinelearning/pull/1569))
+
+    * There is sometimes a need to filter the data used for training a model.
+      For example, you need to remove rows that don't have a label, or focus
+      your model on certain categories of inputs. This can now be done with
+      additional filters as shown
+      [here](https://github.com/dotnet/machinelearning/blob/71d58fa83f77abb630d815e5cf8aa9dd3390aa65/test/Microsoft.ML.Tests/RangeFilterTests.cs#L30).
+
+    * Some estimators iterate over the data multiple times. Instead of always
+      reading from file, you can choose to cache the data to potentially speed
+      things up. An example can be found
+      [here](https://github.com/dotnet/machinelearning/blob/71d58fa83f77abb630d815e5cf8aa9dd3390aa65/test/Microsoft.ML.Tests/CachingTests.cs#L56).
+
+### Acknowledgements
+
+Shoutout to [jwood803](https://github.com/jwood803),
+[feiyun0112](https://github.com/feiyun0112),
+[bojanmisic](https://github.com/bojanmisic),
+[rantri](https://github.com/rantri), [Caraul](https://github.com/Caraul),
+[van-tienhoang](https://github.com/van-tienhoang),
+[Thomas-S-B](https://github.com/Thomas-S-B), and the ML.NET team for their
+contributions as part of this release! 
diff --git a/docs/samples/Microsoft.ML.Samples/Dynamic/IidChangePointDetectorTransform.cs b/docs/samples/Microsoft.ML.Samples/Dynamic/IidChangePointDetectorTransform.cs
@@ -4,6 +4,10 @@
 using Microsoft.ML.Runtime.Data;
 using Microsoft.ML.Runtime.Api;
 using Microsoft.ML.Runtime.TimeSeriesProcessing;
+using Microsoft.ML.Core.Data;
+using Microsoft.ML.TimeSeries;
+using System.IO;
+using Microsoft.ML.Data;
 
 namespace Microsoft.ML.Samples.Dynamic
 {
@@ -34,26 +38,26 @@ public static void IidChangePointDetectorTransform()
             var ml = new MLContext();
 
             // Generate sample series data with a change
-            const int size = 16;
-            var data = new List<IidChangePointData>(size);
-            for (int i = 0; i < size / 2; i++)
+            const int Size = 16;
+            var data = new List<IidChangePointData>(Size);
+            for (int i = 0; i < Size / 2; i++)
                 data.Add(new IidChangePointData(5));
             // This is a change point
-            for (int i = 0; i < size / 2; i++)
+            for (int i = 0; i < Size / 2; i++)
                 data.Add(new IidChangePointData(7));
 
             // Convert data to IDataView.
             var dataView = ml.CreateStreamingDataView(data);
 
             // Setup IidSpikeDetector arguments
-            string outputColumnName = "Prediction";
-            string inputColumnName = "Value";
+            string outputColumnName = nameof(ChangePointPrediction.Prediction);
+            string inputColumnName = nameof(IidChangePointData.Value);
             var args = new IidChangePointDetector.Arguments()
             {
                 Source = inputColumnName,
                 Name = outputColumnName,
                 Confidence = 95,                // The confidence for spike detection in the range [0, 100]
-                ChangeHistoryLength = size / 4, // The length of the sliding window on p-values for computing the martingale score. 
+                ChangeHistoryLength = Size / 4, // The length of the sliding window on p-values for computing the martingale score. 
             };
 
             // The transformed data.
@@ -88,5 +92,116 @@ public static void IidChangePointDetectorTransform()
             // 7       0       7.00    0.50    0.00
             // 7       0       7.00    0.50    0.00
         }
+
+        // This example creates a time series (list of Data with the i-th element corresponding to the i-th time slot). 
+        // IidChangePointDetector is applied then to identify points where data distribution changed using time series 
+        // prediction engine. The engine is checkpointed and then loaded back from disk into memory and used for prediction.
+        public static void IidChangePointDetectorPrediction()
+        {
+            // Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging, 
+            // as well as the source of randomness.
+            var ml = new MLContext();
+
+            // Generate sample series data with a change
+            const int Size = 16;
+            var data = new List<IidChangePointData>(Size);
+            for (int i = 0; i < Size / 2; i++)
+                data.Add(new IidChangePointData(5));
+            // This is a change point
+            for (int i = 0; i < Size / 2; i++)
+                data.Add(new IidChangePointData(7));
+
+            // Convert data to IDataView.
+            var dataView = ml.CreateStreamingDataView(data);
+
+            // Setup IidSpikeDetector arguments
+            string outputColumnName = nameof(ChangePointPrediction.Prediction);
+            string inputColumnName = nameof(IidChangePointData.Value);
+            var args = new IidChangePointDetector.Arguments()
+            {
+                Source = inputColumnName,
+                Name = outputColumnName,
+                Confidence = 95,                // The confidence for spike detection in the range [0, 100]
+                ChangeHistoryLength = Size / 4, // The length of the sliding window on p-values for computing the martingale score. 
+            };
+
+            // Time Series model.
+            ITransformer model = new IidChangePointEstimator(ml, args).Fit(dataView);
+
+            // Create a time series prediction engine from the model.
+            var engine = model.CreateTimeSeriesPredictionFunction<IidChangePointData, ChangePointPrediction>(ml);
+            for(int index = 0; index < 8; index++)
+            {
+                // Anomaly change point detection.
+                var prediction = engine.Predict(new IidChangePointData(5));
+                Console.WriteLine("{0}\t{1}\t{2:0.00}\t{3:0.00}\t{4:0.00}", 5, prediction.Prediction[0], 
+                    prediction.Prediction[1], prediction.Prediction[2], prediction.Prediction[3]);
+            }
+
+            // Change point
+            var changePointPrediction = engine.Predict(new IidChangePointData(7));
+            Console.WriteLine("{0}\t{1}\t{2:0.00}\t{3:0.00}\t{4:0.00}", 7, changePointPrediction.Prediction[0],
+                changePointPrediction.Prediction[1], changePointPrediction.Prediction[2], changePointPrediction.Prediction[3]);
+
+            // Checkpoint the model.
+            var modelPath = "temp.zip";
+            engine.CheckPoint(ml, modelPath);
+
+            // Reference to current time series engine because in the next step "engine" will point to the
+            // checkpointed model being loaded from disk.
+            var timeseries1 = engine;
+
+            // Load the model.
+            using (var file = File.OpenRead(modelPath))
+                model = TransformerChain.LoadFrom(ml, file);
+
+            // Create a time series prediction engine from the checkpointed model.
+            engine = model.CreateTimeSeriesPredictionFunction<IidChangePointData, ChangePointPrediction>(ml);
+            for (int index = 0; index < 8; index++)
+            {
+                // Anomaly change point detection.
+                var prediction = engine.Predict(new IidChangePointData(7));
+                Console.WriteLine("{0}\t{1}\t{2:0.00}\t{3:0.00}\t{4:0.00}", 7, prediction.Prediction[0],
+                    prediction.Prediction[1], prediction.Prediction[2], prediction.Prediction[3]);
+            }
+
+            // Prediction from the original time series engine should match the prediction from 
+            // check pointed model.
+            engine = timeseries1;
+            for (int index = 0; index < 8; index++)
+            {
+                // Anomaly change point detection.
+                var prediction = engine.Predict(new IidChangePointData(7));
+                Console.WriteLine("{0}\t{1}\t{2:0.00}\t{3:0.00}\t{4:0.00}", 7, prediction.Prediction[0],
+                    prediction.Prediction[1], prediction.Prediction[2], prediction.Prediction[3]);
+            }
+
+            // Data Alert      Score   P-Value Martingale value
+            // 5       0       5.00    0.50    0.00       <-- Time Series 1.
+            // 5       0       5.00    0.50    0.00
+            // 5       0       5.00    0.50    0.00
+            // 5       0       5.00    0.50    0.00
+            // 5       0       5.00    0.50    0.00
+            // 5       0       5.00    0.50    0.00
+            // 5       0       5.00    0.50    0.00
+            // 5       0       5.00    0.50    0.00
+            // 7       1       7.00    0.00    10298.67   <-- alert is on, predicted changepoint (and model is checkpointed).
+
+            // 7       0       7.00    0.13    33950.16   <-- Time Series 2 : Model loaded back from disk and prediction is made.
+            // 7       0       7.00    0.26    60866.34
+            // 7       0       7.00    0.38    78362.04
+            // 7       0       7.00    0.50    0.01
+            // 7       0       7.00    0.50    0.00
+            // 7       0       7.00    0.50    0.00
+            // 7       0       7.00    0.50    0.00
+
+            // 7       0       7.00    0.13    33950.16   <-- Time Series 1 and prediction is made.
+            // 7       0       7.00    0.26    60866.34
+            // 7       0       7.00    0.38    78362.04
+            // 7       0       7.00    0.50    0.01
+            // 7       0       7.00    0.50    0.00
+            // 7       0       7.00    0.50    0.00
+            // 7       0       7.00    0.50    0.00
+        }
     }
 }