-
Notifications
You must be signed in to change notification settings - Fork 30
Compare Two Samples
Here's a common statistical question: are two samples compatible? You've measured some quantity (height, income, lifespan) for two groups (male/female, treated/untreated) and you want to know: Is the measured quantity really different for the two groups? Or are whatever differences you see explainable by random fluctuations? Meta.Numerics can perform the statistical test you need.
The APIs that follow expect samples to be in IReadOnlyCollection<double> (or, for a few APIs, IReadOnlyList<double>). You can use any container class that implements the required interface. For simplicity, here are couple List you can use as samples for the following examples:
using System.Collections.Generic;
List<double> a = new List<double>() { 130.0, 140.0, 150.0, 150.0, 160.0, 190.0 };
List<double> b = new List<double>() { 120.0, 150.0, 180.0, 170.0, 185.0, 175.0, 190.0, 200.0 };If you use the Meta.Numerics.Data framework for data wrangling, you can also use any double-compatible column of a FrameView.
Student's t-test looks for a shift in the mean between the two samples.
using System;
using Meta.Numerics.Statistics;
TestResult student = Univariate.StudentTTest(a, b);
Console.WriteLine($"{student.Statistic.Name} = {student.Statistic.Value}");
Console.WriteLine($"{student.Type} P = {student.Probability}");The P-value is the chance that the observed means differ by as much as they do if the two samples actually are drawn from the same population, so a low P-value indicates a statistically significant difference.
The Mann-Whitney test looks for a shift in median between the two samples.
TestResult mannWhitney = Univariate.MannWhitneyTest(a, b);
Console.WriteLine($"{mannWhitney.Statistic.Name} = {mannWhitney.Statistic.Value}");
Console.WriteLine($"{mannWhitney.Type} P = {mannWhitney.Probability}");The P-value is the chance that the observed medians differ by as much as they do if the samples actually are drawn from the same population.
Unlike Student's t-test, the Mann-Whitney test does not depend on the assumption that the distribution of sample means is normal.
The Kolmogorov-Smirnov (KS) test looks for any change in the shape of the distribution of values between the two samples.
TestResult kolmogorov = Univariate.KolmogorovSmirnovTest(a, b);
Console.WriteLine($"{kolmogorov.Statistic.Name} = {kolmogorov.Statistic.Value}");
Console.WriteLine($"{kolmogorov.Type} P = {kolmogorov.Probability}");The P-value is the chance that the D-statistic measuring the difference between the two distributions has a value as large as it does, if the samples are actually drawn from the same population.
The TestResult object from a statistical test method is returned with its Type property set to the most commonly desireded sidedness for that particular test. But nothing prevents you from changing this to the sidedness you want. For example, if you are interested in testing not just whether a and b have different means, but specifically whether a has a lower mean than b, you can use the following code to change the sidedness of the t-test to LeftTailed before computing the P-value:
student.Type = TestType.LeftTailed;
Console.WriteLine($"{student.Type} P = {student.Probability}");Indulge us in a reminder that, for the P-values to have meaning, you must decide the sidedness you will use before analyzing the data. Using the data to find the sign of t, then picking the appropriate sidedness to halve your P-value, is cheating.
The generalization of the t-test to more than two samples is the ANOVA. Meta-Numerics supports both one-way and two-way ANOVAs. The generalization of the Mann-Whitney test to more than two samples is the Kruskal-Wallis test. Meta.Numerics can do that, too.
Yes, when it matters. Meta.Numerics has machinery to calculate exact distributions of the Mann-Whitney and Kolmogorov-Smirnov statistics under the null hypothesis, for any sample size. Unfortunately, the time and memory required to calculate those distributions increases rapidly with sample size. Fortunately, asymptotic approximations that can be quickly computed are available for large sample sizes. For small sample sizes, we use the exact machinery. For large sample sizes, we use the asymptotic approximations. We pick the crossover point so that P-values anywhere near conventional critical values will off by less than 10^{-4}.
- Project
- What's New
- Installation
- Versioning
- Tutorials
- Functions
- Compute a Special Function
- Bessel Functions
- Solvers
- Evaluate An Integral
- Find a Maximum or Minimum
- Solve an Equation
- Integrate a Differential Equation
- Data Wrangling
- Statistics
- Analyze a Sample
- Compare Two Samples
- Simple Linear Regression
- Association
- ANOVA
- Contingency Tables
- Multiple Regression
- Logistic Regression
- Cluster and Component Analysis
- Time Series Analysis
- Fit a Sample to a Distribution
- Distributions
- Special Objects
- Linear Algebra
- Polynomials
- Permutations
- Partitions
- Uncertain Values
- Extended Precision
- Functions