Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGZILLA #17475] lm.wfit issue with small weights #6649

Open
MichaelChirico opened this issue May 19, 2020 · 5 comments
Open

[BUGZILLA #17475] lm.wfit issue with small weights #6649

MichaelChirico opened this issue May 19, 2020 · 5 comments

Comments

@MichaelChirico
Copy link
Owner

I have recently realised that there is major instability in lm.wf for values of weights that are smaller than the machine epsilon. I clarify this issue with an example.

The input data is as below. I should stress that the weights are intentionally designed to reflect some structures in the data

df

y x weight
1.51156139 0.55209240 2.117337e-34
-0.63653132 -0.12599316 2.117337e-34
0.37782776 0.42095384 4.934135e-31
3.03792318 1.40315446 2.679495e-24
1.53646523 0.46076858 2.679495e-24
-2.37727874 -0.73963576 6.244160e-21
0.37183065 0.20407468 1.455107e-17
-1.53917553 -0.95519361 1.455107e-17
1.10926675 0.03897129 3.390908e-14
-0.37786333 -0.17523593 3.390908e-14
2.43973603 0.97970095 7.902000e-11
-0.35432394 -0.03742559 7.902000e-11
2.19296613 1.00355263 4.289362e-04
0.49845532 0.34816207 4.289362e-04
1.25005260 0.76306225 5.000000e-01
0.84360691 0.45152356 5.000000e-01
0.29565993 0.53880068 5.000000e-01
-0.54081334 -0.28104525 5.000000e-01
0.83612836 -0.12885659 9.995711e-01
-1.42526769 -0.87107631 9.999998e-01
0.10204789 -0.11649899 1.000000e+00
1.14292898 0.37249631 1.000000e+00
-3.02942081 -1.28966997 1.000000e+00
-1.37549764 -0.74676145 1.000000e+00
-2.00118016 -0.55182759 1.000000e+00
-4.24441674 -1.94603608 1.000000e+00
1.17168144 1.00868008 1.000000e+00
2.64007761 1.26333069 1.000000e+00
1.98550114 1.18509599 1.000000e+00
-0.58941683 -0.61972416 9.999998e-01
-4.57559611 -2.30914920 9.995711e-01
-0.82610544 -0.39347576 9.995711e-01
-0.02768220 0.20076910 9.995711e-01
0.78186399 0.25690215 9.995711e-01
-0.88314153 -0.20200148 5.000000e-01
-4.17076452 -2.03547588 5.000000e-01
0.93373070 0.54190626 4.289362e-04
-0.08517734 0.17692491 4.289362e-04
-4.47546619 -2.14876688 4.289362e-04
-1.65509103 -0.76898087 4.289362e-04
-0.39403030 -0.12689705 4.289362e-04
0.01203300 -0.18689898 1.841442e-07
-4.82762639 -2.31391121 1.841442e-07
-0.72658380 -0.39751171 3.397282e-14
-2.35886866 -1.01082109 0.000000e+00
-2.03762707 -0.96439902 0.000000e+00
0.90115123 0.60172286 0.000000e+00
1.55999194 0.83433953 0.000000e+00
3.07994058 1.30942776 0.000000e+00
1.78871462 1.10605530 0.000000e+00

Running simple linear model returns:

lm(y~x,data=df)

Call:
lm(formula = y∼ x, data = df)

Coefficients:
(Intercept) x
-0.04173 2.03790

and

max(resid(lm(y~x,data=df)))

[1] 1.14046

HOWEVER if I use the weighted model then:

lm(formula = y∼ x, data = df, weights = df$weights)

Coefficients:
(Intercept) x
-0.05786 1.96087

and

max(resid(lm(y~x,data=df,weights=df$weights)))

[1] 60.91888

as you see, the estimation of the coefficients are nearly the same but the resid() function returns a giant residual.

I am aware that the source of instability is small weights and I understand that the issue can be resolved by zeroing the small weights. However, the cuttof must be controlled by the algorithm in lm.wfit(), either by introducing a user-controlled parameter (such as tol) or setting the cutoff internally proportional to the machine epsilon.


METADATA

  • Bug author - hamedhaseli
  • Creation time - 2018-09-20 14:28:18 UTC
  • Bugzilla link
  • Status - UNCONFIRMED
  • Alias - None
  • Component - Accuracy
  • Version - R 3.5.0
  • Hardware - x86_64/x64/amd64 (64-bit) Windows 64-bit
  • Importance - P5 major
  • Assignee - R-core
  • URL -
  • Modification time - 2018-09-22 17:30 UTC
@MichaelChirico
Copy link
Owner Author

So this is about lm.wft, not lme.wfit? In any case please run:

dput(df)

and paste in the result, to make it easier for folks to reproduce. Bonus points if you do it under a more recent version of R.


METADATA

  • Comment author - Benjamin Tyner
  • Timestamp - 2018-09-21 11:33:07 UTC

@MichaelChirico
Copy link
Owner Author

For the df pleae use the structure below:

dput(df)
structure(list(y = c(1.51156139, -0.63653132, 0.37782776, 3.03792318,
1.53646523, -2.37727874, 0.37183065, -1.53917553, 1.10926675,
-0.37786333, 2.43973603, -0.35432394, 2.19296613, 0.49845532,
1.2500526, 0.84360691, 0.29565993, -0.54081334, 0.83612836, -1.42526769,
0.10204789, 1.14292898, -3.02942081, -1.37549764, -2.00118016,
-4.24441674, 1.17168144, 2.64007761, 1.98550114, -0.58941683,
-4.57559611, -0.82610544, -0.0276822, 0.78186399, -0.88314153,
-4.17076452, 0.9337307, -0.08517734, -4.47546619, -1.65509103,
-0.3940303, 0.012033, -4.82762639, -0.7265838, -2.35886866, -2.03762707,
0.90115123, 1.55999194, 3.07994058, 1.78871462), x = c(0.5520924,
-0.12599316, 0.42095384, 1.40315446, 0.46076858, -0.73963576,
0.20407468, -0.95519361, 0.03897129, -0.17523593, 0.97970095,
-0.03742559, 1.00355263, 0.34816207, 0.76306225, 0.45152356,
0.53880068, -0.28104525, -0.12885659, -0.87107631, -0.11649899,
0.37249631, -1.28966997, -0.74676145, -0.55182759, -1.94603608,
1.00868008, 1.26333069, 1.18509599, -0.61972416, -2.3091492,
-0.39347576, 0.2007691, 0.25690215, -0.20200148, -2.03547588,
0.54190626, 0.17692491, -2.14876688, -0.76898087, -0.12689705,
-0.18689898, -2.31391121, -0.39751171, -1.01082109, -0.96439902,
0.60172286, 0.83433953, 1.30942776, 1.1060553), weight = c(2.117337e-34,
2.117337e-34, 4.934135e-31, 2.679495e-24, 2.679495e-24, 6.24416e-21,
1.455107e-17, 1.455107e-17, 3.390908e-14, 3.390908e-14, 7.902e-11,
7.902e-11, 0.0004289362, 0.0004289362, 0.5, 0.5, 0.5, 0.5, 0.9995711,
0.9999998, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9999998, 0.9995711, 0.9995711,
0.9995711, 0.9995711, 0.5, 0.5, 0.0004289362, 0.0004289362, 0.0004289362,
0.0004289362, 0.0004289362, 1.841442e-07, 1.841442e-07, 3.397282e-14,
0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -50L
))


METADATA

  • Comment author - hamedhaseli
  • Timestamp - 2018-09-21 12:00:42 UTC

@MichaelChirico
Copy link
Owner Author

(In reply to Benjamin Tyner from comment #1)

So this is about lm.wft, not lme.wfit? In any case please run: 

dput(df) 

and paste in the result, to make it easier for folks to reproduce. Bonus
points if you do it under a more recent version of R.

I fixed the title thanks.


METADATA

  • Comment author - hamedhaseli
  • Timestamp - 2018-09-21 12:01:32 UTC

@MichaelChirico
Copy link
Owner Author

Thank you. Arguably, a more serious concern is that the predict() and fitted() methods return divergent results:

fit <- lm(y~x,data=df,weights=df$weight)
min(predict(fit) - fitted(fit))

[1] -10.02918

In any case, if you care about the accuracy of the residual for an observation with such a small weight, you'll need to compute it yourself e.g. via:

df$resid.bad <- resid(fit)
df$resid <- df$y - predict(fit)
head(df)
          y          x       weight   resid.bad      resid

1 1.5115614 0.5520924 2.117337e-34 -7.2436891 0.4868415
2 -0.6365313 -0.1259932 2.117337e-34 -10.3607889 -0.3316117
3 0.3778278 0.4209538 4.934135e-31 -0.3897461 -0.3897461
4 3.0379232 1.4031545 2.679495e-24 0.3443788 0.3443788
5 1.5364652 0.4607686 2.679495e-24 0.6908197 0.6908197
6 -2.3772787 -0.7396358 6.244160e-21 -0.8690840 -0.8690840


METADATA

  • Comment author - Benjamin Tyner
  • Timestamp - 2018-09-22 12:01:32 UTC

@MichaelChirico
Copy link
Owner Author

(In reply to Benjamin Tyner from comment #4)

Thank you. Arguably, a more serious concern is that the predict() and
fitted() methods return divergent results:

> fit <- lm(y~x,data=df,weights=df$weight)
> min(predict(fit) - fitted(fit))
[1] -10.02918

In any case, if you care about the accuracy of the residual for an
observation with such a small weight, you'll need to compute it yourself
e.g. via:


> df$resid.bad <- resid(fit)
> df$resid <- df$y - predict(fit)
> head(df)
y          x       weight   resid.bad      resid
1  1.5115614  0.5520924 2.117337e-34  -7.2436891  0.4868415
2 -0.6365313 -0.1259932 2.117337e-34 -10.3607889 -0.3316117
3  0.3778278  0.4209538 4.934135e-31  -0.3897461 -0.3897461
4  3.0379232  1.4031545 2.679495e-24   0.3443788  0.3443788
5  1.5364652  0.4607686 2.679495e-24   0.6908197  0.6908197
6 -2.3772787 -0.7396358 6.244160e-21  -0.8690840 -0.8690840

Yes, your point is even more serious.

There are some ways of rounding this issue but all need the knowledge of the issue. Other words, we know this bug exists and then we can solve it by alternative methods as you pointed out, however, the other R users may not even know that this issue exists. Then I think that the R core team should fix that on the next release.


METADATA

  • Comment author - hamedhaseli
  • Timestamp - 2018-09-22 17:30:48 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant