[BUGZILLA #17475] lm.wfit issue with small weights #6649

MichaelChirico · 2020-05-19T02:25:28Z

I have recently realised that there is major instability in lm.wf for values of weights that are smaller than the machine epsilon. I clarify this issue with an example.

The input data is as below. I should stress that the weights are intentionally designed to reflect some structures in the data

df

y x weight
1.51156139 0.55209240 2.117337e-34
-0.63653132 -0.12599316 2.117337e-34
0.37782776 0.42095384 4.934135e-31
3.03792318 1.40315446 2.679495e-24
1.53646523 0.46076858 2.679495e-24
-2.37727874 -0.73963576 6.244160e-21
0.37183065 0.20407468 1.455107e-17
-1.53917553 -0.95519361 1.455107e-17
1.10926675 0.03897129 3.390908e-14
-0.37786333 -0.17523593 3.390908e-14
2.43973603 0.97970095 7.902000e-11
-0.35432394 -0.03742559 7.902000e-11
2.19296613 1.00355263 4.289362e-04
0.49845532 0.34816207 4.289362e-04
1.25005260 0.76306225 5.000000e-01
0.84360691 0.45152356 5.000000e-01
0.29565993 0.53880068 5.000000e-01
-0.54081334 -0.28104525 5.000000e-01
0.83612836 -0.12885659 9.995711e-01
-1.42526769 -0.87107631 9.999998e-01
0.10204789 -0.11649899 1.000000e+00
1.14292898 0.37249631 1.000000e+00
-3.02942081 -1.28966997 1.000000e+00
-1.37549764 -0.74676145 1.000000e+00
-2.00118016 -0.55182759 1.000000e+00
-4.24441674 -1.94603608 1.000000e+00
1.17168144 1.00868008 1.000000e+00
2.64007761 1.26333069 1.000000e+00
1.98550114 1.18509599 1.000000e+00
-0.58941683 -0.61972416 9.999998e-01
-4.57559611 -2.30914920 9.995711e-01
-0.82610544 -0.39347576 9.995711e-01
-0.02768220 0.20076910 9.995711e-01
0.78186399 0.25690215 9.995711e-01
-0.88314153 -0.20200148 5.000000e-01
-4.17076452 -2.03547588 5.000000e-01
0.93373070 0.54190626 4.289362e-04
-0.08517734 0.17692491 4.289362e-04
-4.47546619 -2.14876688 4.289362e-04
-1.65509103 -0.76898087 4.289362e-04
-0.39403030 -0.12689705 4.289362e-04
0.01203300 -0.18689898 1.841442e-07
-4.82762639 -2.31391121 1.841442e-07
-0.72658380 -0.39751171 3.397282e-14
-2.35886866 -1.01082109 0.000000e+00
-2.03762707 -0.96439902 0.000000e+00
0.90115123 0.60172286 0.000000e+00
1.55999194 0.83433953 0.000000e+00
3.07994058 1.30942776 0.000000e+00
1.78871462 1.10605530 0.000000e+00

Running simple linear model returns:

lm(y~x,data=df)

Call:
lm(formula = y∼ x, data = df)

Coefficients:
(Intercept) x
-0.04173 2.03790

and

max(resid(lm(y~x,data=df)))

[1] 1.14046

HOWEVER if I use the weighted model then:

lm(formula = y∼ x, data = df, weights = df$weights)

Coefficients:
(Intercept) x
-0.05786 1.96087

and

max(resid(lm(y~x,data=df,weights=df$weights)))

[1] 60.91888

as you see, the estimation of the coefficients are nearly the same but the resid() function returns a giant residual.

I am aware that the source of instability is small weights and I understand that the issue can be resolved by zeroing the small weights. However, the cuttof must be controlled by the algorithm in lm.wfit(), either by introducing a user-controlled parameter (such as tol) or setting the cutoff internally proportional to the machine epsilon.

METADATA

Bug author - hamedhaseli
Creation time - 2018-09-20 14:28:18 UTC
Bugzilla link
Status - UNCONFIRMED
Alias - None
Component - Accuracy
Version - R 3.5.0
Hardware - x86_64/x64/amd64 (64-bit) Windows 64-bit
Importance - P5 major
Assignee - R-core
URL -
Modification time - 2018-09-22 17:30 UTC

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2020-05-19T02:25:29Z

So this is about lm.wft, not lme.wfit? In any case please run:

dput(df)

and paste in the result, to make it easier for folks to reproduce. Bonus points if you do it under a more recent version of R.

METADATA

Comment author - Benjamin Tyner
Timestamp - 2018-09-21 11:33:07 UTC

MichaelChirico · 2020-05-19T02:25:30Z

For the df pleae use the structure below:

dput(df)
structure(list(y = c(1.51156139, -0.63653132, 0.37782776, 3.03792318,
1.53646523, -2.37727874, 0.37183065, -1.53917553, 1.10926675,
-0.37786333, 2.43973603, -0.35432394, 2.19296613, 0.49845532,
1.2500526, 0.84360691, 0.29565993, -0.54081334, 0.83612836, -1.42526769,
0.10204789, 1.14292898, -3.02942081, -1.37549764, -2.00118016,
-4.24441674, 1.17168144, 2.64007761, 1.98550114, -0.58941683,
-4.57559611, -0.82610544, -0.0276822, 0.78186399, -0.88314153,
-4.17076452, 0.9337307, -0.08517734, -4.47546619, -1.65509103,
-0.3940303, 0.012033, -4.82762639, -0.7265838, -2.35886866, -2.03762707,
0.90115123, 1.55999194, 3.07994058, 1.78871462), x = c(0.5520924,
-0.12599316, 0.42095384, 1.40315446, 0.46076858, -0.73963576,
0.20407468, -0.95519361, 0.03897129, -0.17523593, 0.97970095,
-0.03742559, 1.00355263, 0.34816207, 0.76306225, 0.45152356,
0.53880068, -0.28104525, -0.12885659, -0.87107631, -0.11649899,
0.37249631, -1.28966997, -0.74676145, -0.55182759, -1.94603608,
1.00868008, 1.26333069, 1.18509599, -0.61972416, -2.3091492,
-0.39347576, 0.2007691, 0.25690215, -0.20200148, -2.03547588,
0.54190626, 0.17692491, -2.14876688, -0.76898087, -0.12689705,
-0.18689898, -2.31391121, -0.39751171, -1.01082109, -0.96439902,
0.60172286, 0.83433953, 1.30942776, 1.1060553), weight = c(2.117337e-34,
2.117337e-34, 4.934135e-31, 2.679495e-24, 2.679495e-24, 6.24416e-21,
1.455107e-17, 1.455107e-17, 3.390908e-14, 3.390908e-14, 7.902e-11,
7.902e-11, 0.0004289362, 0.0004289362, 0.5, 0.5, 0.5, 0.5, 0.9995711,
0.9999998, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9999998, 0.9995711, 0.9995711,
0.9995711, 0.9995711, 0.5, 0.5, 0.0004289362, 0.0004289362, 0.0004289362,
0.0004289362, 0.0004289362, 1.841442e-07, 1.841442e-07, 3.397282e-14,
0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -50L
))

METADATA

Comment author - hamedhaseli
Timestamp - 2018-09-21 12:00:42 UTC

MichaelChirico · 2020-05-19T02:25:31Z

(In reply to Benjamin Tyner from comment #1)

So this is about lm.wft, not lme.wfit? In any case please run: 

dput(df) 

and paste in the result, to make it easier for folks to reproduce. Bonus
points if you do it under a more recent version of R.

I fixed the title thanks.

METADATA

Comment author - hamedhaseli
Timestamp - 2018-09-21 12:01:32 UTC

MichaelChirico · 2020-05-19T02:25:31Z

Thank you. Arguably, a more serious concern is that the predict() and fitted() methods return divergent results:

fit <- lm(y~x,data=df,weights=df$weight)
min(predict(fit) - fitted(fit))

[1] -10.02918

In any case, if you care about the accuracy of the residual for an observation with such a small weight, you'll need to compute it yourself e.g. via:

df$resid.bad <- resid(fit)
df$resid <- df$y - predict(fit)
head(df)

          y          x       weight   resid.bad      resid

1 1.5115614 0.5520924 2.117337e-34 -7.2436891 0.4868415
2 -0.6365313 -0.1259932 2.117337e-34 -10.3607889 -0.3316117
3 0.3778278 0.4209538 4.934135e-31 -0.3897461 -0.3897461
4 3.0379232 1.4031545 2.679495e-24 0.3443788 0.3443788
5 1.5364652 0.4607686 2.679495e-24 0.6908197 0.6908197
6 -2.3772787 -0.7396358 6.244160e-21 -0.8690840 -0.8690840

METADATA

Comment author - Benjamin Tyner
Timestamp - 2018-09-22 12:01:32 UTC

MichaelChirico · 2020-05-19T02:25:32Z

(In reply to Benjamin Tyner from comment #4)

Thank you. Arguably, a more serious concern is that the predict() and
fitted() methods return divergent results:

> fit <- lm(y~x,data=df,weights=df$weight)
> min(predict(fit) - fitted(fit))
[1] -10.02918

In any case, if you care about the accuracy of the residual for an
observation with such a small weight, you'll need to compute it yourself
e.g. via:


> df$resid.bad <- resid(fit)
> df$resid <- df$y - predict(fit)
> head(df)
y          x       weight   resid.bad      resid
1  1.5115614  0.5520924 2.117337e-34  -7.2436891  0.4868415
2 -0.6365313 -0.1259932 2.117337e-34 -10.3607889 -0.3316117
3  0.3778278  0.4209538 4.934135e-31  -0.3897461 -0.3897461
4  3.0379232  1.4031545 2.679495e-24   0.3443788  0.3443788
5  1.5364652  0.4607686 2.679495e-24   0.6908197  0.6908197
6 -2.3772787 -0.7396358 6.244160e-21  -0.8690840 -0.8690840

Yes, your point is even more serious.

There are some ways of rounding this issue but all need the knowledge of the issue. Other words, we know this bug exists and then we can solve it by alternative methods as you pointed out, however, the other R users may not even know that this issue exists. Then I think that the R core team should fix that on the next release.

METADATA

Comment author - hamedhaseli
Timestamp - 2018-09-22 17:30:48 UTC

MichaelChirico added Component - Accuracy Hardware - x86_64/x64/amd64 (64-bit) Windows 64-bi Importance - P5 major Status - UNCONFIRMED Version - R 3.5.0 labels May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGZILLA #17475] lm.wfit issue with small weights #6649

[BUGZILLA #17475] lm.wfit issue with small weights #6649

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

[BUGZILLA #17475] lm.wfit issue with small weights #6649

[BUGZILLA #17475] lm.wfit issue with small weights #6649

Comments

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA