Skip to content

Compare Two Samples

David Wright edited this page Apr 16, 2018 · 6 revisions

Here's a common statistical question: are two samples compatible? You've measured some quantity (height, income, lifespan) for two groups (male/female, treated/untreated) and you want to know: Is the measured quantity really different for the two groups? Or are whatever differences you see explainable by random fluctuations? Meta.Numerics can perform the statistical test you need.

The APIs that follow expect samples to be in IReadOnlyCollection<double> (or, for a few APIs, IReadOnlyList<double>). You can use any container class that implements the required interface. For simplicity, here are couple List you can use as samples for the following examples:

using System.Collections.Generic;

List<double> a = new List<double>() { 130.0, 140.0, 150.0, 150.0, 160.0, 190.0 };
List<double> b = new List<double>() { 120.0, 150.0, 180.0, 170.0, 185.0, 175.0, 190.0, 200.0 };

If you use the Meta.Numerics.Data framework for data wrangling, you can also use any double-compatible column of a FrameView.

Student's t-Test

Student's t-test looks for a shift in the mean between the two samples.

using System;
using Meta.Numerics.Statistics;

TestResult student = Univariate.StudentTTest(a, b);
Console.WriteLine($"{student.Statistic.Name} = {student.Statistic.Value}");
Console.WriteLine($"{student.Type} P = {student.Probability}");

The P-value is the chance that the observed means differ by as much as they do if the two samples actually are drawn from the same population, so a low P-value indicates a statistically significant difference.

Mann-Whitney Test

The Mann-Whitney test looks for a shift in median between the two samples.

TestResult mannWhitney = Univariate.MannWhitneyTest(a, b);
Console.WriteLine($"{mannWhitney.Statistic.Name} = {mannWhitney.Statistic.Value}");
Console.WriteLine($"{mannWhitney.Type} P = {mannWhitney.Probability}");

The P-value is the chance that the observed medians differ by as much as they do if the samples actually are drawn from the same population.

Unlike Student's t-test, the Mann-Whitney test does not depend on the assumption that the distribution of sample means is normal.

Komgororov-Smirnov Test

The Kolmogorov-Smirnov (KS) test looks for any change in the shape of the distribution of values between the two samples.

TestResult kolmogorov = Univariate.KolmogorovSmirnovTest(a, b);
Console.WriteLine($"{kolmogorov.Statistic.Name} = {kolmogorov.Statistic.Value}");
Console.WriteLine($"{kolmogorov.Type} P = {kolmogorov.Probability}");

The P-value is the chance that the D-statistic measuring the difference between the two distributions has a value as large as it does, if the samples are actually drawn from the same population.

What about test sided-ness?

The TestResult object from a statistical test method is returned with its Type property set to the most commonly desireded sidedness for that particular test. But nothing prevents you from changing this to the sidedness you want. For example, if you are interested in testing not just whether a and b have different means, but specifically whether a has a lower mean than b, you can use the following code to change the sidedness of the t-test to LeftTailed before computing the P-value:

student.Type = TestType.LeftTailed;
Console.WriteLine($"{student.Type} P = {student.Probability}");

Indulge us in a reminder that, for the P-values to have meaning, you must decide the sidedness you will use before analyzing the data. Using the data to find the sign of t, then picking the appropriate sidedness to halve your P-value, is cheating.

What about more than two samples?

The generalization of the t-test to more than two samples is the ANOVA. Meta-Numerics supports both one-way and two-way ANOVAs. The generalization of the Mann-Whitney test to more than two samples is the Kruskal-Wallis test. Meta.Numerics can do that, too.

Are Null Distributions Exact?

Yes, when it matters. Meta.Numerics has machinery to calculate exact distributions of the Mann-Whitney and Kolmogorov-Smirnov statistics under the null hypothesis, for any sample size. Unfortunately, the time and memory required to calculate those distributions increases rapidly with sample size. Fortunately, asymptotic approximations that can be quickly computed are available for large sample sizes. For small sample sizes, we use the exact machinery. For large sample sizes, we use the asymptotic approximations. We pick the crossover point so that P-values anywhere near conventional critical values will off by less than 10^{-4}.

Home

Clone this wiki locally