Skip to content

Commit 2ac56d4

Browse files
committed
Update README.md
1 parent 2e1746c commit 2ac56d4

File tree

1 file changed

+108
-33
lines changed

1 file changed

+108
-33
lines changed

README.md

Lines changed: 108 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,28 @@
1616
- Terrell-Scott’s Rule (1985)
1717
- Rice University Rule
1818

19+
20+
1921
## Requirements
2022

2123
This library requires PHP 8.3 or newer. Support of older versions like [markrogoyski/math-php](https://github.com/markrogoyski/math-php) provides for PHP 7.2+ is not planned.
2224

25+
26+
2327
## Installation
2428

2529
```bash
2630
composer require tomkyle/binning
2731
```
2832

2933

34+
3035
## Usage
3136

3237
The **BinSelection** class provides several methods for determining the optimal number of bins for histogram creation and optimal bin width. You can either use specific methods directly or the general `suggestBins()` and `suggestBinWidth()` methods with different strategies.
3338

39+
40+
3441
### Determine Bin Width
3542

3643
Use the **suggestBinWidth** method to get the *optimal bin width* based on the selected method. The method returns the bin width, often referred to as 𝒉, as a float value.
@@ -92,6 +99,10 @@ $k = BinSelection::suggestBins($data, BinSelection::RICE);
9299

93100

94101

102+
---
103+
104+
105+
95106
### Explicit method calls
96107

97108
You can also call the specific methods directly to get the bin width 𝒉 or number of bins 𝒌.
@@ -103,35 +114,53 @@ The result array contains additional information like the data range 𝑹, the i
103114

104115

105116

117+
---
118+
119+
120+
106121
#### 1. Pearson’s Square Root Rule (1892)
107122

108123
Simple rule using the square root of the sample size.
109124

110-
$ k = \left \lceil \sqrt{n} \; \right \rceil $
125+
$$
126+
k = \left \lceil \sqrt{n} \; \right \rceil
127+
$$
111128

112129
```php
113130
$k = BinSelection::squareRoot($data);
114131
```
115132

116133

117134

135+
---
136+
137+
138+
118139
#### 2. Sturges’s Rule (1926)
119140

120141
Based on the logarithm of the sample size. Good for normal distributions.
121142

122-
$ k = 1 + \left \lceil \; \log_2(n) \; \right \rceil $
143+
$$
144+
k = 1 + \left \lceil \; \log_2(n) \; \right \rceil
145+
$$
123146

124147
```php
125148
$k = BinSelection::sturges($data);
126149
```
127150

128151

129152

153+
---
154+
155+
156+
130157
#### 3. Doane’s Rule (1976)
131158

132159
Improvement of *Sturges*’ rule that accounts for data skewness.
133160

134-
$ k = 1 + \left\lceil \; \log_2(n) + \log_2\left(1 + \frac{|g_1|}{\sigma_{g_1}}\right) \; \right \rceil $
161+
$$
162+
k = 1 + \left\lceil \; \log_2(n) + \log_2\left(1 + \frac{|g_1|}{\sigma_{g_1}}\right) \; \right \rceil
163+
$$
135164

136165
```php
137166
// Using sample-based calculation (default)
@@ -143,15 +172,25 @@ $k = BinSelection::doane($data, population: true);
143172

144173

145174

175+
---
176+
177+
178+
146179
#### 4. Scott’s Rule (1979)
147180

148181
Based on the standard deviation and sample size. Good for continuous data.
149182

150-
$ h = \frac{3.49\,\hat{\sigma}}{\sqrt[3]{n}} $
183+
$$
184+
h = \frac{3.49\,\hat{\sigma}}{\sqrt[3]{n}}
185+
$$
151186

152-
$ R = \max_i x_i - \min_i x_i $
187+
$$
188+
R = \max_i x_i - \min_i x_i
189+
$$
153190

154-
$ k = \left \lceil \frac{R}{h} \right \rceil $
191+
$$
192+
k = \left \lceil \frac{R}{h} \right \rceil
193+
$$
155194

156195
The result is an array with keys `width`, `bins`, `range`, and `stddev`. Map them to variables like so:
157196

@@ -161,17 +200,29 @@ list($h, $k, $R, stddev) = BinSelection::scott($data);
161200

162201

163202

203+
---
204+
205+
206+
164207
#### 5. Freedman-Diaconis Rule (1981)
165208

166209
Based on the interquartile range (IQR). Robust against outliers.
167210

168-
$ IQR = Q_3 - Q_1 $
211+
$$
212+
IQR = Q_3 - Q_1
213+
$$
169214

170-
$ h = 2 \times \frac{\mathrm{IQR}}{\sqrt[3]{n}} $
215+
$$
216+
h = 2 \times \frac{\mathrm{IQR}}{\sqrt[3]{n}}
217+
$$
171218

172-
$ R = \text{max}_i x_i - \text{min}_i x_i $
219+
$$
220+
R = \text{max}_i x_i - \text{min}_i x_i
221+
$$
173222

174-
$ k = \left \lceil \frac{R}{h} \right \rceil $
223+
$$
224+
k = \left \lceil \frac{R}{h} \right \rceil
225+
$$
175226

176227
The result is an array with keys `width`, `bins`, `range`, and `IQR`. Map them to variables like so:
177228

@@ -181,30 +232,73 @@ list($h, $k, $R, $IQR) = BinSelection::freedmanDiaconis($data);
181232

182233

183234

235+
---
236+
237+
238+
184239
#### 6. Terrell-Scott’s Rule (1985)
185240

186241
Uses the cube root of the sample size, generally provides more bins than *Sturges*. This is the original *Rice Rule*:
187242

188-
$ k = \left \lceil \; \sqrt[3]{2n} \enspace \right \rceil = \left \lceil \; (2n)^{1/3} \; \right \rceil $
243+
$$
244+
k = \left \lceil \; \sqrt[3]{2n} \enspace \right \rceil = \left \lceil \; (2n)^{1/3} \; \right \rceil
245+
$$
189246

190247
```php
191248
$k = BinSelection::terrellScott($data);
192249
```
193250

194251

195252

253+
---
254+
255+
256+
196257
#### 7. Rice University Rule
197258

198259
Uses the cube root of the sample size, generally provides more bins than *Sturges*. Formula as taught by David M. Lane at Rice University. — **N.B.** This *Rice Rule* seems to be not the original. In fact, *Terrell-Scott’s* (1985) seems to be. Also note that both variants can yield different results under certain circumstances. This Lane’s variant from the early 2000s is however more commonly cited:
199260

200-
$ k = 2 \times \left \lceil \; \sqrt[3]{n} \enspace \right \rceil = 2 \times \left \lceil \; n^{1/3} \; \right \rceil $
261+
$$
262+
k = 2 \times \left \lceil \; \sqrt[3]{n} \enspace \right \rceil = 2 \times \left \lceil \; n^{1/3} \; \right \rceil
263+
$$
201264

202265
```php
203266
$k = BinSelection::rice($data);
204267
```
205268

206269

207270

271+
---
272+
273+
274+
275+
## Method Selection Guidelines
276+
277+
| Rule | Strengths & Weaknesses |
278+
| --------------------- | ------------------------------------------------------------ |
279+
| **Freedman–Diaconis** | Uses the IQR to set 𝒉, so it is robust against outliers and adapts to data spread. <br />⚠️ May over‐smooth heavily skewed or multi‐modal data when IQR is small. |
280+
| **Sturges’ Rule** | Very simple, works well for roughly normal, moderate-sized datasets. <br />⚠️ Ignores outliers and underestimates bin count for large or skewed samples. |
281+
| **Rice Rule** | Independent of data shape and easy to compute. <br />⚠️ Prone to over‐ or under‐smoothing when the distribution is heavy‐tailed or skewed. |
282+
| **Terrell–Scott** | Similar approach as *Rice Rule* but with asymptotically optimal MISE properties; gives more bins than Sturges and adapts better at large 𝒏. <br />⚠️ Still ignores skewness and outliers. |
283+
| **Square Root Rule** | Simply the square root, so it requires no distributional estimates. <br />⚠️ May produce too few bins for complex distributions — or too many for very noisy data. |
284+
| **Doane’s Rule** | Extends *Sturges’ Rule* by adding a skewness correction. Improving performance on asymmetric data.<br />⚠️ Requires estimating the third moment (skewness), which can be unstable for small 𝒏. |
285+
| **Scott’s Rule** | Uses standard deviation to minimize MISE, providing good balance for unimodal, symmetric data. <br />⚠️ Sensitive to outliers (inflated $\sigma$) and may underperform on skewed distributions. |
286+
287+
288+
289+
## Literature
290+
291+
Rubia, J.M.D.L. (2024):
292+
**Rice University Rule to Determine the Number of Bins.**
293+
Open Journal of Statistics, 14, 119-149.
294+
DOI: [10.4236/ojs.2024.141006](https://doi.org/10.4236/ojs.2024.141006)
295+
296+
Wikipedia:
297+
**Histogram / Number of bins and width**
298+
https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width
299+
300+
301+
208302
## Practical Example
209303

210304
```php
@@ -236,6 +330,8 @@ foreach ($methods as $name => $method) {
236330
}
237331
```
238332

333+
334+
239335
## Error Handling
240336

241337
All methods will throw `InvalidArgumentException` for invalid inputs:
@@ -258,28 +354,7 @@ try {
258354
}
259355
```
260356

261-
## Method Selection Guidelines
262-
263-
| Rule | Strengths & Weaknesses |
264-
| --------------------- | ------------------------------------------------------------ |
265-
| **Freedman–Diaconis** | Uses the IQR to set 𝒉, so it is robust against outliers and adapts to data spread. <br />⚠️ May over‐smooth heavily skewed or multi‐modal data when IQR is small. |
266-
| **Sturges’ Rule** | Very simple, works well for roughly normal, moderate-sized datasets. <br />⚠️ Ignores outliers and underestimates bin count for large or skewed samples. |
267-
| **Rice Rule** | Independent of data shape and easy to compute. <br />⚠️ Prone to over‐ or under‐smoothing when the distribution is heavy‐tailed or skewed. |
268-
| **Terrell–Scott** | Similar approach as *Rice Rule* but with asymptotically optimal MISE properties; gives more bins than Sturges and adapts better at large 𝒏. <br />⚠️ Still ignores skewness and outliers. |
269-
| **Square Root Rule** | Simply the square root, so it requires no distributional estimates. <br />⚠️ May produce too few bins for complex distributions — or too many for very noisy data. |
270-
| **Doane’s Rule** | Extends *Sturges’ Rule* by adding a skewness correction. Improving performance on asymmetric data.<br />⚠️ Requires estimating the third moment (skewness), which can be unstable for small 𝒏. |
271-
| **Scott’s Rule** | Uses standard deviation to minimize MISE, providing good balance for unimodal, symmetric data. <br />⚠️ Sensitive to outliers (inflated $\sigma$) and may underperform on skewed distributions. |
272357

273-
## Literature
274-
275-
Rubia, J.M.D.L. (2024):
276-
**Rice University Rule to Determine the Number of Bins.**
277-
Open Journal of Statistics, 14, 119-149.
278-
DOI: [10.4236/ojs.2024.141006](https://doi.org/10.4236/ojs.2024.141006)
279-
280-
Wikipedia:
281-
**Histogram / Number of bins and width**
282-
https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width
283358

284359

285360

0 commit comments

Comments
 (0)