Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Generalised Linear Models to FSharpStats #334

Merged
merged 54 commits into from
Oct 24, 2024
Merged
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
6288d55
Implement IRLS solver for GLMs
LibraChris Feb 8, 2024
425d12b
Rename variables
LibraChris May 8, 2024
d1c9c41
add qr based GLM
LibraChris May 10, 2024
d637be7
add inital tests for the glm
LibraChris May 10, 2024
d0fc5ee
Update glm QR Solver
LibraChris May 14, 2024
d24955a
Add new Test for GLMs using Gamma Distribution
LibraChris May 14, 2024
ddcf09c
Add tests for the Poisson linker functions
LibraChris May 14, 2024
60c3ec1
Add tests for the Gamma linker functions
LibraChris May 14, 2024
0f0661c
Rename testcases to Reflect their log function
LibraChris May 14, 2024
fe83ba6
Add tests for the LogitLinkFunction
LibraChris May 14, 2024
ab44068
Add tests for the InverseSquaredLinkFunction
LibraChris May 15, 2024
c00980e
Add tests by example for glm IrLS solver
LibraChris May 15, 2024
1e6a524
Add tests for the IdentityLinkFunction
LibraChris May 15, 2024
ac416bf
Add tests groudwork for the BinomialLinkFunction
LibraChris May 15, 2024
8f88c1e
Add tests for the variance of Binominal Family
LibraChris May 16, 2024
148a933
Add tests for the variance of Poisson Family
LibraChris May 16, 2024
a8b5f00
Add tests for the variance of Gaussian/Normal Family
LibraChris May 16, 2024
2cbef3c
Fix test implemetation for familyVarianceFunctions
LibraChris May 16, 2024
a73a07e
Add tests for the variance of Gamma Family
LibraChris May 16, 2024
4d03d46
Add tests for the variance of Inv.Gaussian Family
LibraChris May 16, 2024
4465115
Rename test Cases based on their DistributionFamily
LibraChris May 16, 2024
a6e6568
Fix LogitLinkFunction
LibraChris May 16, 2024
de1fcd7
remove redundant BinomialLinkFunction
LibraChris May 16, 2024
3554a02
Remove redundant LinkFunction
LibraChris May 16, 2024
c1f38f1
Fix InverseSquaredLinkFunction
LibraChris May 16, 2024
2a3b096
Updated Gamma Distribution Variance function
LibraChris May 16, 2024
3f5a349
add Deriv Functions
LibraChris May 18, 2024
2787fbd
add Tests for Link and deriv
LibraChris May 18, 2024
3ee33e3
fix various Linkfunctions
LibraChris May 18, 2024
43cea23
Rework GLM QR Solver
LibraChris May 22, 2024
3e83833
Modify tests
LibraChris May 22, 2024
2816155
Add tests prototype for QR-Stepwise iteration
LibraChris May 22, 2024
c5ced84
Fix QR based solver for GLMs
LibraChris May 22, 2024
5029c3a
Modify Variance tests
LibraChris May 22, 2024
c3dddcb
Update statistics
LibraChris May 28, 2024
a7c5c1b
Update GeneralisedLinearModel.fs
LibraChris May 28, 2024
d8877b7
Update GeneralisedLinearModel.fs
LibraChris May 29, 2024
3cd68a8
Update GeneralisedLinearModel.fs
LibraChris May 30, 2024
253ac91
Rework GLMStatistics
LibraChris May 31, 2024
19cad0f
Remove deprecated GLM.Irls
LibraChris May 31, 2024
1b3336f
Fix minor testing issue
LibraChris May 31, 2024
a1d0ee4
add getFamilyReisualDeviance for more families
LibraChris Jun 2, 2024
37d03e0
Write code comments and documentation
LibraChris Jun 5, 2024
5e9a1b6
add Documentation for GLM Usage
LibraChris Jun 7, 2024
72bfb83
Update formating for documentation
LibraChris Jun 10, 2024
a8a0004
added data for Documentation
LibraChris Jun 10, 2024
f694340
remote tests for binominal family variance
LibraChris Jun 10, 2024
8dcd8ab
Adress changes requested in #344
LibraChris Jun 12, 2024
170519e
Adress changes requested in #334
LibraChris Jun 18, 2024
ba5ae9c
Update xml comments
LibraChris Jun 19, 2024
6c3a235
fix building error
LibraChris Jul 3, 2024
2e80081
Fix Typo
LibraChris Jul 3, 2024
13b3de9
Fix indentations
LibraChris Aug 26, 2024
df24c3f
Updated XML documentation
LibraChris Oct 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update formating for documentation
  • Loading branch information
LibraChris committed Jul 3, 2024
commit 72bfb83ad0719b282631e65b6907a984c083c38a
58 changes: 31 additions & 27 deletions docs/GeneralisedLinearModels.fsx
Original file line number Diff line number Diff line change
Expand Up @@ -82,14 +82,16 @@ In the context of programming, this equation could be implemented in a variety o
## Loading the Dataset
First, let's read some data to learn how to utilize Generalized Linear Models (GLMs). Below is the code to read the cheeseDataset, which is sourced from David S. Moore and George P. McCabe's "Introduction to the Practice of Statistics" (1993), second edition, published by W. H. Freeman and Company, available on the [Statlib database](https://dasl.datadescription.com). It contains information on the taste and concentration of various chemical components in 30 matured cheddar cheeses from the LaTrobe Valley in Victoria, Australia. The final Taste score is an aggregate of the scores given by several tasters.
*)
open Plotly.NET
open FSharp.Stats
open Deedle

let cheeseDataset :Frame<int,string>=
Frame.ReadCsv "/Users/lux/Library/CloudStorage/OneDrive-ComputationalSystemsBiology/Projects/GeneralLinearModel/data/cheese.csv"
|> Frame.indexRows "Column1"

(***include-value:cheeseDataset***)


(**
## Creating Histograms

Expand Down Expand Up @@ -123,7 +125,8 @@ let histograms =
|> Chart.withTraceInfo("Lactic")
Chart.Grid(2,2) [histogramTaste; histogramAcetic; histogramH2S; histogramLactic]

histograms
(***include-value:histograms***)


(**
## Preparing Data for GLM
Expand Down Expand Up @@ -156,40 +159,40 @@ let updatedIndependentMatrix =
The next step we need to take is to determine which linker functions to use in our Model.
Generalized Linear Models extend linear models to allow for response variables that have error distribution models other than a normal distribution. The choice of distribution family in a GLM depends on the nature of the response variable (dependent variable). Here is a summary of when to use each GLM distribution family:

**Normal (Gaussian) Distribution**:
- **Use when**: The response variable is continuous and normally distributed.
- **Common applications**: Linear regression, ANOVA, ANCOVA.
**Normal (Gaussian) Distribution**: <br>
- **Use when**: The response variable is continuous and normally distributed. <br>
- **Common applications**: Linear regression, ANOVA, ANCOVA. <br>
- **Examples**: Heights, weights, test scores.

**Binomial Distribution**:
- **Use when**: The response variable is binary (0 or 1) or proportion data.
- **Common applications**: Logistic regression, probit regression.
**Binomial Distribution**: <br>
- **Use when**: The response variable is binary (0 or 1) or proportion data. <br>
- **Common applications**: Logistic regression, probit regression. <br>
- **Examples**: Yes/No outcomes, success/failure data.

**Poisson Distribution**:
- **Use when**: The response variable represents count data, especially counts of rare events.
- **Common applications**: Poisson regression.
**Poisson Distribution**: <br>
- **Use when**: The response variable represents count data, especially counts of rare events. <br>
- **Common applications**: Poisson regression. <br>
- **Examples**: Number of customer complaints, number of accidents.

**Negative Binomial Distribution**:
- **Use when**: The response variable is count data with overdispersion (variance greater than the mean).
- **Common applications**: Negative binomial regression.
**Negative Binomial Distribution**: <br>
- **Use when**: The response variable is count data with overdispersion (variance greater than the mean). <br>
- **Common applications**: Negative binomial regression. <br>
- **Examples**: Number of insurance claims, number of hospital visits.

**Gamma Distribution**:
- **Use when**: The response variable is continuous and positive, often for skewed distributions.
- **Common applications**: Gamma regression.
**Gamma Distribution**: <br>
- **Use when**: The response variable is continuous and positive, often for skewed distributions. <br>
- **Common applications**: Gamma regression. <br>
- **Examples**: Insurance claims costs, time until an event occurs.

**Inverse Gaussian Distribution**:
- **Use when**: The response variable is continuous and positive, and particularly when the data has a long right tail.
- **Common applications**: Inverse Gaussian regression.
**Inverse Gaussian Distribution**: <br>
- **Use when**: The response variable is continuous and positive, and particularly when the data has a long right tail. <br>
- **Common applications**: Inverse Gaussian regression. <br>
- **Examples**: Reaction times, survival times.


**Multinomial Distribution**:
- **Use when**: The response variable represents categorical data with more than two categories.
- **Common applications**: Multinomial logistic regression.
**Multinomial Distribution**: <br>
- **Use when**: The response variable represents categorical data with more than two categories. <br>
- **Common applications**: Multinomial logistic regression. <br>
- **Examples**: Survey responses with multiple choices, type of disease diagnosis.

Each distribution family has a corresponding link function that relates the linear predictor to the mean of the distribution. The choice of link function can also be tailored to better fit the specific characteristics of the data. Common link functions include the identity link, log link, logit link, and inverse link, among others.
Expand All @@ -207,14 +210,14 @@ let b = dependentVector
let maxIter = 100

// Distribution family of the dependent variable
let distributionFamily = Fitting.GLM.GlmDistributionFamily.Poisson
let distributionFamily = FSharp.Stats.Fitting.GLM.GlmDistributionFamily.Poisson

// Tolerance for the convergence of the algorithm, usually 1e-11 or 1e-6
let mTol = 1e-6

// Fit the model
let glm =
Fitting.GLM.QR.solveQrNewton A b maxIter distributionFamily mTol
FSharp.Stats.Fitting.GLM.QR.solveQrNewton A b maxIter distributionFamily mTol

glm
(***include-value:glm***)
Expand All @@ -234,7 +237,7 @@ Using this map we can also access the zScore and Pearson scores of each of the p
*)

let glmPredictions =
Fitting.GLM.QR.getGLMParameterStatistics A b glm ["Intercept"; "Acetic"; "H2S"; "Lactic"]
FSharp.Stats.Fitting.GLM.QR.getGLMParameterStatistics A b glm ["Intercept"; "Acetic"; "H2S"; "Lactic"]
|> Map.ofSeq

(***include-value:glmPredictions***)
Expand Down Expand Up @@ -304,4 +307,5 @@ Pearson Chi-Square is another measure of goodness of fit. It assesses how well t
These statistics together give us a comprehensive view of the model's performance and its ability to explain the variability in the data.
*)

Fitting.GLM.QR.getGLMModelStatistics b glm distributionFamily
let glmStats = FSharp.Stats.Fitting.GLM.QR.getGLMModelStatistics b glm distributionFamily
(***include-value:glmStats***)