You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Quiz
You are analyzing a dataset where each observation is an age, height, length, and width of a particular turtle. You want to know if the data can be well described by fewer than four dimensions (maybe for plotting), so you decide to do Principal Component Analysis. Which of the following is most likely to be the loadings of the first Principal Component?
answer:
[] (1, 1, 1, 1)
[x] (.5, .5, .5, .5) correct
[] (.71, -.71, 0, 0)
[] (1, -1, -1, -1)
Higher order Principal Component
* Quiz
Suppose we a data set where each data point represents a single student's scores on a math test, a physics test, a reading comprehension test, and a vocabulary test.
We find the first two principal components, which capture 90% of the variability in the data, and interpret their loadings. We conclude that the first principal component represents overall academic ability, and the second represents a contrast between quantitative ability and verbal ability.
What loadings would be consistent with that interpretation? Choose all that apply.
(0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
(0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
[x] (0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
[x] (0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)
(0.71, 0.71, 0, 0) and (0, 0, 0.71, 0.71)
(0.71, 0, -0.71, 0) and (0, 0.71, 0, -0.71)
K-clustering
1/ True or False: If we use k-means clustering, will we get the same cluster assignments for each point, whether or not we standardize the variables.
[] True
[x] False correct
Hierarchical Clustering
* Quiz
True or False: If we cut the dendrogram at a lower point, we will tend to get more clusters (and cannot get fewer clusters).
[x] True correct
[] False
Breast cancer example
In the heat map for breast cancer data, which of the following depended on the output of hierarchical clustering?
[x] The ordering of the rows
[x] The ordering of the columns
[] The coloring of the cells as red or green
Unsupervised in R
* Quiz
0/ Suppose we want to fit a linear regression, but the number of variables is much larger than the number of observations. In some cases, we may improve the fit by reducing the dimension of the features before.
In this problem, we use a data set with n = 300 and p = 200, so we have more observations than variables, but not by much. Load the data x, y, x.test, and y.test from 10.R.RData.
First, concatenate x and x.test using the rbind functions and perform a principal components analysis on the concatenated data frame (use the "scale=TRUE" option). To within 10% relative error, what proportion of the variance is explained by the first five
Answer: 0.3498565
1/ The previous answer suggests that a relatively small number of "latent variables" account for a substantial fraction of the features' variability. We might believe that these latent variables are more important than linear combinations of the features that have low variance.
We can try forgetting about the raw features and using the first five principal components (computed on rbind(x,x.test)) instead as low-dimensional derived features. What is the mean-squared test error if we regress y on the first five principal components, and use the resulting model to predict y.test?
Answer: 1
In the actual data generating model for this example, the features may be noisy proxies for a few latent variables that actually drive the response. This is not an uncommon situation when we have high-dimensional data.
2/ Now, try an OLS linear regression of y on the matrix x. What is the mean squared predition error if we use the fitted model to predict y.test from x.test?
Answer: 3.657 ; The mean squared error is worse because of the variance involved in fitting a very high dimensional model. As it turned out here, the large-variance directions of x turned out to be the important ones for predicting y. Note that this need not always be the case, but it often is.
-- Run code:
> load("/Users/tungthanhle/Box Sync/MOOCs/Statistical-Learning-Stanford/R session/10.R.RData")
> ls()
[1] "alpha" "alpha.fn" "boot.out"
[4] "cv.error" "cv.error10" "d"
[7] "degree" "df" "df.melt"
[10] "df1" "df2" "df3"
[13] "df4" "df9a1" "df9a2"
[16] "Direction.2005" "esoph_melt" "fit1"
[19] "g" "glm.fit" "glm.pred"
[22] "glm.probs" "hc.average" "hc.complete"
[25] "hc.cut" "hc.single" "km.out"
[28] "legend.key" "legend.key.size" "legend.title"
[31] "loocv" "p" "pca.out"
[34] "regplot" "test" "train"
[37] "which" "x" "x.test"
[40] "xmean" "Xy" "y"
[43] "y.test"
> xvars = rbind(x,x.test) # bind the variables
> dataset1 = data.frame(xvars) #create a dataset
> pca.out = prcomp(dataset1, scale = TRUE) # prinicipal component analysis model
> pca.out$sdev # gives standard deviations
[1] 5.0564664 4.5965404 3.7229223 2.6971281 1.4630811 1.1682700
[7] 1.1584782 1.1554447 1.1459109 1.1393333 1.1361861 1.1318983
[13] 1.1104748 1.1081031 1.1022576 1.0996006 1.0925670 1.0862245
[19] 1.0851878 1.0747535 1.0694183 1.0626451 1.0596740 1.0579027
[25] 1.0543595 1.0497312 1.0461509 1.0451297 1.0355655 1.0293788
[31] 1.0246770 1.0236170 1.0212631 1.0174942 1.0143562 1.0111483
[37] 1.0051017 1.0002184 0.9971100 0.9942072 0.9913248 0.9854912
[43] 0.9803071 0.9796830 0.9771657 0.9738685 0.9681446 0.9576192
[49] 0.9560902 0.9531962 0.9516221 0.9496676 0.9464778 0.9408281
[55] 0.9393607 0.9318866 0.9263620 0.9257830 0.9229648 0.9195461
[61] 0.9177641 0.9152911 0.9101935 0.9082005 0.9019688 0.8981175
[67] 0.8965373 0.8946101 0.8870541 0.8824777 0.8798454 0.8781490
[73] 0.8760034 0.8751288 0.8733763 0.8693101 0.8651374 0.8610424
[79] 0.8569558 0.8562169 0.8548111 0.8498707 0.8464987 0.8451881
[85] 0.8424197 0.8376215 0.8363309 0.8338686 0.8289554 0.8241035
[91] 0.8221352 0.8177683 0.8173337 0.8110224 0.8101821 0.8086392
[97] 0.8031119 0.8009746 0.7976889 0.7960114 0.7931180 0.7894603
[103] 0.7865821 0.7849213 0.7819922 0.7772328 0.7740821 0.7733873
[109] 0.7677353 0.7668402 0.7639293 0.7607650 0.7591061 0.7551551
[115] 0.7548301 0.7502366 0.7477301 0.7439305 0.7408162 0.7380059
[121] 0.7335575 0.7320845 0.7276506 0.7266989 0.7233278 0.7218310
[127] 0.7191747 0.7163773 0.7105583 0.7095616 0.7038146 0.7029900
[133] 0.6993003 0.6973292 0.6937365 0.6912564 0.6903477 0.6885033
[139] 0.6857440 0.6830917 0.6820117 0.6776269 0.6748846 0.6733988
[145] 0.6660594 0.6600808 0.6579576 0.6551212 0.6498976 0.6485745
[151] 0.6471284 0.6455747 0.6444500 0.6401032 0.6363138 0.6345960
[157] 0.6314878 0.6304404 0.6255655 0.6201083 0.6182002 0.6150316
[163] 0.6117353 0.6103122 0.6060084 0.6056634 0.6012351 0.5991488
[169] 0.5948291 0.5906015 0.5859830 0.5836761 0.5811308 0.5745975
[175] 0.5729187 0.5705985 0.5674488 0.5648463 0.5606856 0.5573744
[181] 0.5502935 0.5491188 0.5467863 0.5449206 0.5413659 0.5396379
[187] 0.5353621 0.5264635 0.5244417 0.5207019 0.5153204 0.5124382
[193] 0.5106356 0.5030117 0.5011382 0.4911595 0.4872318 0.4848873
[199] 0.4809934 0.4338230
> screeplot(pca.out) # Scree plot shows variance explained per principal component
> (pca.out$sdev)^2/ sum(pca.out$sdev^2)
[1] 0.1278392623 0.1056409183 0.0693007523 0.0363725007
[5] 0.0107030317 0.0068242745 0.0067103583 0.0066752627
[9] 0.0065655588 0.0064904024 0.0064545937 0.0064059692
[13] 0.0061657709 0.0061394622 0.0060748595 0.0060456069
[17] 0.0059685132 0.0058994183 0.0058881627 0.0057754755
[21] 0.0057182773 0.0056460726 0.0056145445 0.0055957908
[25] 0.0055583696 0.0055096776 0.0054721590 0.0054614799
[29] 0.0053619795 0.0052981033 0.0052498150 0.0052389584
[33] 0.0052148915 0.0051764724 0.0051445927 0.0051121049
[37] 0.0050511467 0.0050021844 0.0049711415 0.0049422401
[41] 0.0049136244 0.0048559649 0.0048050097 0.0047988943
[45] 0.0047742642 0.0047420998 0.0046865195 0.0045851723
[49] 0.0045705424 0.0045429154 0.0045279231 0.0045093423
[53] 0.0044791009 0.0044257876 0.0044119930 0.0043420631
[57] 0.0042907331 0.0042853704 0.0042593199 0.0042278251
[61] 0.0042114544 0.0041887894 0.0041422607 0.0041241409
[65] 0.0040677386 0.0040330751 0.0040188957 0.0040016357
[69] 0.0039343249 0.0038938347 0.0038706398 0.0038557287
[73] 0.0038369096 0.0038292523 0.0038139310 0.0037785000
[77] 0.0037423137 0.0037069703 0.0036718658 0.0036655366
[81] 0.0036535101 0.0036114012 0.0035828003 0.0035717146
[85] 0.0035483547 0.0035080490 0.0034972468 0.0034766843
[89] 0.0034358354 0.0033957331 0.0033795316 0.0033437247
[93] 0.0033401718 0.0032887867 0.0032819748 0.0032694866
[97] 0.0032249440 0.0032078015 0.0031815380 0.0031681707
[101] 0.0031451808 0.0031162379 0.0030935568 0.0030805076
[105] 0.0030575590 0.0030204542 0.0029960152 0.0029906398
[109] 0.0029470877 0.0029402197 0.0029179396 0.0028938172
[113] 0.0028812100 0.0028512962 0.0028488426 0.0028142748
[117] 0.0027955018 0.0027671627 0.0027440431 0.0027232635
[121] 0.0026905334 0.0026797384 0.0026473773 0.0026404564
[125] 0.0026160158 0.0026052002 0.0025860610 0.0025659821
[129] 0.0025244652 0.0025173882 0.0024767750 0.0024709747
[133] 0.0024451047 0.0024313399 0.0024063519 0.0023891770
[137] 0.0023828997 0.0023701842 0.0023512243 0.0023330710
[141] 0.0023257001 0.0022958914 0.0022773462 0.0022673295
[145] 0.0022181756 0.0021785331 0.0021645410 0.0021459191
[149] 0.0021118347 0.0021032444 0.0020938758 0.0020838333
[153] 0.0020765788 0.0020486608 0.0020244761 0.0020135604
[157] 0.0019938841 0.0019872753 0.0019566612 0.0019226713
[161] 0.0019108571 0.0018913195 0.0018711006 0.0018624051
[165] 0.0018362310 0.0018341405 0.0018074183 0.0017948964
[169] 0.0017691084 0.0017440507 0.0017168802 0.0017033890
[173] 0.0016885651 0.0016508115 0.0016411794 0.0016279131
[177] 0.0016099909 0.0015952569 0.0015718418 0.0015533311
[181] 0.0015141146 0.0015076574 0.0014948762 0.0014846925
[185] 0.0014653853 0.0014560455 0.0014330627 0.0013858190
[189] 0.0013751954 0.0013556525 0.0013277753 0.0013129643
[193] 0.0013037434 0.0012651038 0.0012556974 0.0012061884
[197] 0.0011869744 0.0011755787 0.0011567731 0.0009410121
> sum(0.1278392623, 0.1056409183, 0.0693007523, 0.0363725007, 0.0107030317) #Take sum of first five
[1] 0.3498565
> #alternative method is to use the cumulative sum function, which gives the same 0.3498565 on fifth entry
> cumsum((pca.out$sdev)^2) / sum(pca.out$sdev^2)
[1] 0.1278393 0.2334802 0.3027809 0.3391534 0.3498565 0.3566807
[7] 0.3633911 0.3700664 0.3766319 0.3831223 0.3895769 0.3959829
[13] 0.4021487 0.4082881 0.4143630 0.4204086 0.4263771 0.4322765
[19] 0.4381647 0.4439402 0.4496584 0.4553045 0.4609190 0.4665148
[25] 0.4720732 0.4775829 0.4830550 0.4885165 0.4938785 0.4991766
[31] 0.5044264 0.5096654 0.5148803 0.5200567 0.5252013 0.5303134
[37] 0.5353646 0.5403668 0.5453379 0.5502802 0.5551938 0.5600497
[43] 0.5648548 0.5696536 0.5744279 0.5791700 0.5838565 0.5884417
[49] 0.5930122 0.5975552 0.6020831 0.6065924 0.6110715 0.6154973
[55] 0.6199093 0.6242514 0.6285421 0.6328275 0.6370868 0.6413146
[61] 0.6455261 0.6497149 0.6538571 0.6579813 0.6620490 0.6660821
[67] 0.6701010 0.6741026 0.6780369 0.6819308 0.6858014 0.6896571
[73] 0.6934940 0.6973233 0.7011372 0.7049157 0.7086580 0.7123650
[79] 0.7160369 0.7197024 0.7233559 0.7269673 0.7305501 0.7341218
[85] 0.7376702 0.7411782 0.7446755 0.7481522 0.7515880 0.7549837
[91] 0.7583633 0.7617070 0.7650472 0.7683360 0.7716179 0.7748874
[97] 0.7781124 0.7813202 0.7845017 0.7876699 0.7908151 0.7939313
[103] 0.7970249 0.8001054 0.8031629 0.8061834 0.8091794 0.8121700
[109] 0.8151171 0.8180573 0.8209753 0.8238691 0.8267503 0.8296016
[115] 0.8324504 0.8352647 0.8380602 0.8408274 0.8435714 0.8462947
[121] 0.8489852 0.8516650 0.8543123 0.8569528 0.8595688 0.8621740
[127] 0.8647601 0.8673261 0.8698505 0.8723679 0.8748447 0.8773157
[133] 0.8797608 0.8821921 0.8845985 0.8869876 0.8893705 0.8917407
[139] 0.8940919 0.8964250 0.8987507 0.9010466 0.9033239 0.9055913
[145] 0.9078095 0.9099880 0.9121525 0.9142984 0.9164103 0.9185135
[151] 0.9206074 0.9226912 0.9247678 0.9268165 0.9288409 0.9308545
[157] 0.9328484 0.9348357 0.9367923 0.9387150 0.9406259 0.9425172
[163] 0.9443883 0.9462507 0.9480869 0.9499211 0.9517285 0.9535234
[169] 0.9552925 0.9570365 0.9587534 0.9604568 0.9621454 0.9637962
[175] 0.9654374 0.9670653 0.9686753 0.9702705 0.9718424 0.9733957
[181] 0.9749098 0.9764175 0.9779123 0.9793970 0.9808624 0.9823185
[187] 0.9837515 0.9851373 0.9865125 0.9878682 0.9891960 0.9905089
[193] 0.9918127 0.9930778 0.9943335 0.9955397 0.9967266 0.9979022
[199] 0.9990590 1.0000000
> summary(pca.out)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 5.0565 4.5965 3.7229 2.69713 1.4631
Proportion of Variance 0.1278 0.1056 0.0693 0.03637 0.0107
Cumulative Proportion 0.1278 0.2335 0.3028 0.33915 0.3499
PC6 PC7 PC8 PC9 PC10
Standard deviation 1.16827 1.15848 1.15544 1.14591 1.13933
Proportion of Variance 0.00682 0.00671 0.00668 0.00657 0.00649
Cumulative Proportion 0.35668 0.36339 0.37007 0.37663 0.38312
PC11 PC12 PC13 PC14 PC15
Standard deviation 1.13619 1.13190 1.11047 1.10810 1.10226
Proportion of Variance 0.00645 0.00641 0.00617 0.00614 0.00607
Cumulative Proportion 0.38958 0.39598 0.40215 0.40829 0.41436
PC16 PC17 PC18 PC19 PC20
Standard deviation 1.09960 1.09257 1.0862 1.08519 1.07475
Proportion of Variance 0.00605 0.00597 0.0059 0.00589 0.00578
Cumulative Proportion 0.42041 0.42638 0.4323 0.43816 0.44394
PC21 PC22 PC23 PC24 PC25
Standard deviation 1.06942 1.06265 1.05967 1.0579 1.05436
Proportion of Variance 0.00572 0.00565 0.00561 0.0056 0.00556
Cumulative Proportion 0.44966 0.45530 0.46092 0.4665 0.47207
PC26 PC27 PC28 PC29 PC30
Standard deviation 1.04973 1.04615 1.04513 1.03557 1.0294
Proportion of Variance 0.00551 0.00547 0.00546 0.00536 0.0053
Cumulative Proportion 0.47758 0.48306 0.48852 0.49388 0.4992
PC31 PC32 PC33 PC34 PC35
Standard deviation 1.02468 1.02362 1.02126 1.01749 1.01436
Proportion of Variance 0.00525 0.00524 0.00521 0.00518 0.00514
Cumulative Proportion 0.50443 0.50967 0.51488 0.52006 0.52520
PC36 PC37 PC38 PC39 PC40
Standard deviation 1.01115 1.00510 1.0002 0.99711 0.99421
Proportion of Variance 0.00511 0.00505 0.0050 0.00497 0.00494
Cumulative Proportion 0.53031 0.53536 0.5404 0.54534 0.55028
PC41 PC42 PC43 PC44 PC45
Standard deviation 0.99132 0.98549 0.98031 0.9797 0.97717
Proportion of Variance 0.00491 0.00486 0.00481 0.0048 0.00477
Cumulative Proportion 0.55519 0.56005 0.56485 0.5696 0.57443
PC46 PC47 PC48 PC49 PC50
Standard deviation 0.97387 0.96814 0.95762 0.95609 0.95320
Proportion of Variance 0.00474 0.00469 0.00459 0.00457 0.00454
Cumulative Proportion 0.57917 0.58386 0.58844 0.59301 0.59756
PC51 PC52 PC53 PC54 PC55
Standard deviation 0.95162 0.94967 0.94648 0.94083 0.93936
Proportion of Variance 0.00453 0.00451 0.00448 0.00443 0.00441
Cumulative Proportion 0.60208 0.60659 0.61107 0.61550 0.61991
PC56 PC57 PC58 PC59 PC60
Standard deviation 0.93189 0.92636 0.92578 0.92296 0.91955
Proportion of Variance 0.00434 0.00429 0.00429 0.00426 0.00423
Cumulative Proportion 0.62425 0.62854 0.63283 0.63709 0.64131
PC61 PC62 PC63 PC64 PC65
Standard deviation 0.91776 0.91529 0.91019 0.90820 0.90197
Proportion of Variance 0.00421 0.00419 0.00414 0.00412 0.00407
Cumulative Proportion 0.64553 0.64971 0.65386 0.65798 0.66205
PC66 PC67 PC68 PC69 PC70
Standard deviation 0.89812 0.89654 0.8946 0.88705 0.88248
Proportion of Variance 0.00403 0.00402 0.0040 0.00393 0.00389
Cumulative Proportion 0.66608 0.67010 0.6741 0.67804 0.68193
PC71 PC72 PC73 PC74 PC75
Standard deviation 0.87985 0.87815 0.87600 0.87513 0.87338
Proportion of Variance 0.00387 0.00386 0.00384 0.00383 0.00381
Cumulative Proportion 0.68580 0.68966 0.69349 0.69732 0.70114
PC76 PC77 PC78 PC79 PC80
Standard deviation 0.86931 0.86514 0.86104 0.85696 0.85622
Proportion of Variance 0.00378 0.00374 0.00371 0.00367 0.00367
Cumulative Proportion 0.70492 0.70866 0.71237 0.71604 0.71970
PC81 PC82 PC83 PC84 PC85
Standard deviation 0.85481 0.84987 0.84650 0.84519 0.84242
Proportion of Variance 0.00365 0.00361 0.00358 0.00357 0.00355
Cumulative Proportion 0.72336 0.72697 0.73055 0.73412 0.73767
PC86 PC87 PC88 PC89 PC90
Standard deviation 0.83762 0.8363 0.83387 0.82896 0.8241
Proportion of Variance 0.00351 0.0035 0.00348 0.00344 0.0034
Cumulative Proportion 0.74118 0.7447 0.74815 0.75159 0.7550
PC91 PC92 PC93 PC94 PC95
Standard deviation 0.82214 0.81777 0.81733 0.81102 0.81018
Proportion of Variance 0.00338 0.00334 0.00334 0.00329 0.00328
Cumulative Proportion 0.75836 0.76171 0.76505 0.76834 0.77162
PC96 PC97 PC98 PC99 PC100
Standard deviation 0.80864 0.80311 0.80097 0.79769 0.79601
Proportion of Variance 0.00327 0.00322 0.00321 0.00318 0.00317
Cumulative Proportion 0.77489 0.77811 0.78132 0.78450 0.78767
PC101 PC102 PC103 PC104 PC105
Standard deviation 0.79312 0.78946 0.78658 0.78492 0.78199
Proportion of Variance 0.00315 0.00312 0.00309 0.00308 0.00306
Cumulative Proportion 0.79082 0.79393 0.79702 0.80011 0.80316
PC106 PC107 PC108 PC109 PC110
Standard deviation 0.77723 0.7741 0.77339 0.76774 0.76684
Proportion of Variance 0.00302 0.0030 0.00299 0.00295 0.00294
Cumulative Proportion 0.80618 0.8092 0.81217 0.81512 0.81806
PC111 PC112 PC113 PC114 PC115
Standard deviation 0.76393 0.76077 0.75911 0.75516 0.75483
Proportion of Variance 0.00292 0.00289 0.00288 0.00285 0.00285
Cumulative Proportion 0.82098 0.82387 0.82675 0.82960 0.83245
PC116 PC117 PC118 PC119 PC120
Standard deviation 0.75024 0.7477 0.74393 0.74082 0.73801
Proportion of Variance 0.00281 0.0028 0.00277 0.00274 0.00272
Cumulative Proportion 0.83526 0.8381 0.84083 0.84357 0.84629
PC121 PC122 PC123 PC124 PC125
Standard deviation 0.73356 0.73208 0.72765 0.72670 0.72333
Proportion of Variance 0.00269 0.00268 0.00265 0.00264 0.00262
Cumulative Proportion 0.84899 0.85166 0.85431 0.85695 0.85957
PC126 PC127 PC128 PC129 PC130
Standard deviation 0.72183 0.71917 0.71638 0.71056 0.70956
Proportion of Variance 0.00261 0.00259 0.00257 0.00252 0.00252
Cumulative Proportion 0.86217 0.86476 0.86733 0.86985 0.87237
PC131 PC132 PC133 PC134 PC135
Standard deviation 0.70381 0.70299 0.69930 0.69733 0.69374
Proportion of Variance 0.00248 0.00247 0.00245 0.00243 0.00241
Cumulative Proportion 0.87484 0.87732 0.87976 0.88219 0.88460
PC136 PC137 PC138 PC139 PC140
Standard deviation 0.69126 0.69035 0.68850 0.68574 0.68309
Proportion of Variance 0.00239 0.00238 0.00237 0.00235 0.00233
Cumulative Proportion 0.88699 0.88937 0.89174 0.89409 0.89643
PC141 PC142 PC143 PC144 PC145
Standard deviation 0.68201 0.6776 0.67488 0.67340 0.66606
Proportion of Variance 0.00233 0.0023 0.00228 0.00227 0.00222
Cumulative Proportion 0.89875 0.9011 0.90332 0.90559 0.90781
PC146 PC147 PC148 PC149 PC150
Standard deviation 0.66008 0.65796 0.65512 0.64990 0.6486
Proportion of Variance 0.00218 0.00216 0.00215 0.00211 0.0021
Cumulative Proportion 0.90999 0.91215 0.91430 0.91641 0.9185
PC151 PC152 PC153 PC154 PC155
Standard deviation 0.64713 0.64557 0.64445 0.64010 0.63631
Proportion of Variance 0.00209 0.00208 0.00208 0.00205 0.00202
Cumulative Proportion 0.92061 0.92269 0.92477 0.92682 0.92884
PC156 PC157 PC158 PC159 PC160
Standard deviation 0.63460 0.63149 0.63044 0.62557 0.62011
Proportion of Variance 0.00201 0.00199 0.00199 0.00196 0.00192
Cumulative Proportion 0.93085 0.93285 0.93484 0.93679 0.93872
PC161 PC162 PC163 PC164 PC165
Standard deviation 0.61820 0.61503 0.61174 0.61031 0.60601
Proportion of Variance 0.00191 0.00189 0.00187 0.00186 0.00184
Cumulative Proportion 0.94063 0.94252 0.94439 0.94625 0.94809
PC166 PC167 PC168 PC169 PC170
Standard deviation 0.60566 0.60124 0.59915 0.59483 0.59060
Proportion of Variance 0.00183 0.00181 0.00179 0.00177 0.00174
Cumulative Proportion 0.94992 0.95173 0.95352 0.95529 0.95704
PC171 PC172 PC173 PC174 PC175
Standard deviation 0.58598 0.5837 0.58113 0.57460 0.57292
Proportion of Variance 0.00172 0.0017 0.00169 0.00165 0.00164
Cumulative Proportion 0.95875 0.9605 0.96215 0.96380 0.96544
PC176 PC177 PC178 PC179 PC180
Standard deviation 0.57060 0.56745 0.5648 0.56069 0.55737
Proportion of Variance 0.00163 0.00161 0.0016 0.00157 0.00155
Cumulative Proportion 0.96707 0.96868 0.9703 0.97184 0.97340
PC181 PC182 PC183 PC184 PC185
Standard deviation 0.55029 0.54912 0.54679 0.54492 0.54137
Proportion of Variance 0.00151 0.00151 0.00149 0.00148 0.00147
Cumulative Proportion 0.97491 0.97642 0.97791 0.97940 0.98086
PC186 PC187 PC188 PC189 PC190
Standard deviation 0.53964 0.53536 0.52646 0.52444 0.52070
Proportion of Variance 0.00146 0.00143 0.00139 0.00138 0.00136
Cumulative Proportion 0.98232 0.98375 0.98514 0.98651 0.98787
PC191 PC192 PC193 PC194 PC195
Standard deviation 0.51532 0.51244 0.5106 0.50301 0.50114
Proportion of Variance 0.00133 0.00131 0.0013 0.00127 0.00126
Cumulative Proportion 0.98920 0.99051 0.9918 0.99308 0.99433
PC196 PC197 PC198 PC199 PC200
Standard deviation 0.49116 0.48723 0.48489 0.48099 0.43382
Proportion of Variance 0.00121 0.00119 0.00118 0.00116 0.00094
Cumulative Proportion 0.99554 0.99673 0.99790 0.99906 1.00000
> #Q.2 The previous answer suggests that a relatively small number of '"latent variables" account for a substantial fraction of the features' variability. We might believe that these
> latent variables are more important than linear combinations of the features that have low variance.
Error: unexpected symbol in "latent variables"
> We can try forgetting about the raw features and using the first five principal components (computed on rbind(x,x.test)) instead as low-dimensional derived features. What is the mean-squared test error if we regress y on the first five principal components, and use the resulting model to predict y.test?
Error: unexpected symbol in "We can"
> xols<-pca.out$x[1:300,1:5]
> fit0 <- lm(y ~ xols)
> summary(fit0)
Call:
lm(formula = y ~ xols)
Residuals:
Min 1Q Median 3Q Max
-3.3289 -0.6992 0.0319 0.8075 2.5240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09541 0.06107 1.562 0.119314
xolsPC1 0.07608 0.01159 6.564 2.36e-10 ***
xolsPC2 -0.02276 0.01314 -1.732 0.084309 .
xolsPC3 -0.04023 0.01538 -2.616 0.009352 **
xolsPC4 -0.06368 0.02237 -2.847 0.004722 **
xolsPC5 -0.16069 0.04299 -3.738 0.000223 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.056 on 294 degrees of freedom
Multiple R-squared: 0.1906, Adjusted R-squared: 0.1769
F-statistic: 13.85 on 5 and 294 DF, p-value: 3.704e-12
> yhat0 = predict(fit0, x.test)
Warning message:
'newdata' had 1000 rows but variables found have 300 rows
> mean((yhat0-y.test)^2)
[1] 1.413063
Warning message:
In yhat0 - y.test :
longer object length is not a multiple of shorter object length
> lm(formula = y ~ xols)
Call:
lm(formula = y ~ xols)
Coefficients:
(Intercept) xolsPC1 xolsPC2 xolsPC3 xolsPC4
0.09541 0.07608 -0.02276 -0.04023 -0.06368
xolsPC5
-0.16069
> yhat0 = predict(fit0, x.test)
Warning message:
'newdata' had 1000 rows but variables found have 300 rows
> mean((yhat0-y.test)^2)
[1] 1.413063
Warning message:
In yhat0 - y.test :
longer object length is not a multiple of shorter object length
> cumsum((pca.out$sdev)^2) / sum(pca.out$sdev^2)
[1] 0.1278393 0.2334802 0.3027809 0.3391534 0.3498565 0.3566807
[7] 0.3633911 0.3700664 0.3766319 0.3831223 0.3895769 0.3959829
[13] 0.4021487 0.4082881 0.4143630 0.4204086 0.4263771 0.4322765
[19] 0.4381647 0.4439402 0.4496584 0.4553045 0.4609190 0.4665148
[25] 0.4720732 0.4775829 0.4830550 0.4885165 0.4938785 0.4991766
[31] 0.5044264 0.5096654 0.5148803 0.5200567 0.5252013 0.5303134
[37] 0.5353646 0.5403668 0.5453379 0.5502802 0.5551938 0.5600497
[43] 0.5648548 0.5696536 0.5744279 0.5791700 0.5838565 0.5884417
[49] 0.5930122 0.5975552 0.6020831 0.6065924 0.6110715 0.6154973
[55] 0.6199093 0.6242514 0.6285421 0.6328275 0.6370868 0.6413146
[61] 0.6455261 0.6497149 0.6538571 0.6579813 0.6620490 0.6660821
[67] 0.6701010 0.6741026 0.6780369 0.6819308 0.6858014 0.6896571
[73] 0.6934940 0.6973233 0.7011372 0.7049157 0.7086580 0.7123650
[79] 0.7160369 0.7197024 0.7233559 0.7269673 0.7305501 0.7341218
[85] 0.7376702 0.7411782 0.7446755 0.7481522 0.7515880 0.7549837
[91] 0.7583633 0.7617070 0.7650472 0.7683360 0.7716179 0.7748874
[97] 0.7781124 0.7813202 0.7845017 0.7876699 0.7908151 0.7939313
[103] 0.7970249 0.8001054 0.8031629 0.8061834 0.8091794 0.8121700
[109] 0.8151171 0.8180573 0.8209753 0.8238691 0.8267503 0.8296016
[115] 0.8324504 0.8352647 0.8380602 0.8408274 0.8435714 0.8462947
[121] 0.8489852 0.8516650 0.8543123 0.8569528 0.8595688 0.8621740
[127] 0.8647601 0.8673261 0.8698505 0.8723679 0.8748447 0.8773157
[133] 0.8797608 0.8821921 0.8845985 0.8869876 0.8893705 0.8917407
[139] 0.8940919 0.8964250 0.8987507 0.9010466 0.9033239 0.9055913
[145] 0.9078095 0.9099880 0.9121525 0.9142984 0.9164103 0.9185135
[151] 0.9206074 0.9226912 0.9247678 0.9268165 0.9288409 0.9308545
[157] 0.9328484 0.9348357 0.9367923 0.9387150 0.9406259 0.9425172
[163] 0.9443883 0.9462507 0.9480869 0.9499211 0.9517285 0.9535234
[169] 0.9552925 0.9570365 0.9587534 0.9604568 0.9621454 0.9637962
[175] 0.9654374 0.9670653 0.9686753 0.9702705 0.9718424 0.9733957
[181] 0.9749098 0.9764175 0.9779123 0.9793970 0.9808624 0.9823185
[187] 0.9837515 0.9851373 0.9865125 0.9878682 0.9891960 0.9905089
[193] 0.9918127 0.9930778 0.9943335 0.9955397 0.9967266 0.9979022
[199] 0.9990590 1.0000000
> fit<-lm(y~.,x) # Run linear model
> summary(fit)
Call:
lm(formula = y ~ ., data = x)
Residuals:
Min 1Q Median 3Q Max
-2.07787 -0.39188 -0.01094 0.46463 2.07281
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2038165 0.1084088 1.880 0.06304 .
X1 0.0956991 0.1083461 0.883 0.37923
X2 -0.0768577 0.0905108 -0.849 0.39784
X3 -0.1450540 0.1006050 -1.442 0.15251
X4 -0.1129411 0.0977190 -1.156 0.25056
X5 0.1998471 0.1028714 1.943 0.05489 .
X6 0.0995771 0.1132813 0.879 0.38152
X7 -0.0597859 0.1082694 -0.552 0.58206
X8 0.0943231 0.1134184 0.832 0.40761
X9 -0.1473773 0.0997019 -1.478 0.14253
X10 -0.0678923 0.1179652 -0.576 0.56624
X11 0.1484546 0.1060616 1.400 0.16473
X12 0.1061323 0.1119930 0.948 0.34561
X13 -0.0478415 0.0997639 -0.480 0.63261
X14 -0.1972534 0.1166065 -1.692 0.09386 .
X15 0.1791277 0.1011492 1.771 0.07965 .
X16 0.0628750 0.1145631 0.549 0.58436
X17 0.0485600 0.1023332 0.475 0.63617
X18 0.1903937 0.1062388 1.792 0.07617 .
X19 -0.1722703 0.1022425 -1.685 0.09515 .
X20 -0.2073261 0.0918588 -2.257 0.02621 *
X21 -0.0036302 0.1181700 -0.031 0.97555
X22 -0.1129806 0.1121757 -1.007 0.31631
X23 0.0923017 0.1201585 0.768 0.44422
X24 0.3016380 0.1136175 2.655 0.00925 **
X25 -0.0468127 0.1221786 -0.383 0.70243
X26 -0.0017065 0.1148268 -0.015 0.98817
X27 -0.1411393 0.1041841 -1.355 0.17859
X28 0.0285920 0.1065200 0.268 0.78893
X29 0.0237577 0.1093413 0.217 0.82844
X30 0.2165345 0.1053510 2.055 0.04248 *
X31 -0.0848819 0.1105101 -0.768 0.44426
X32 0.0720145 0.1049882 0.686 0.49436
X33 0.1120851 0.1057827 1.060 0.29191
X34 -0.1106502 0.0980774 -1.128 0.26197
X35 0.0884572 0.1164613 0.760 0.44933
X36 -0.1026070 0.1063570 -0.965 0.33703
X37 0.1097935 0.1074168 1.022 0.30921
X38 0.0768076 0.1022340 0.751 0.45426
X39 0.0054946 0.1086705 0.051 0.95978
X40 0.0550993 0.1040337 0.530 0.59755
X41 -0.1938989 0.1126176 -1.722 0.08824 .
X42 0.0344070 0.1087861 0.316 0.75245
X43 -0.0238439 0.1154725 -0.206 0.83683
X44 -0.1549566 0.1021085 -1.518 0.13231
X45 0.1881682 0.1122008 1.677 0.09668 .
X46 0.0405429 0.0991426 0.409 0.68347
X47 -0.0910576 0.1065085 -0.855 0.39465
X48 0.0466373 0.1059641 0.440 0.66081
X49 -0.0617196 0.0922411 -0.669 0.50498
X50 0.0596620 0.1091763 0.546 0.58597
X51 -0.0434226 0.1042990 -0.416 0.67807
X52 0.2683813 0.1145112 2.344 0.02109 *
X53 0.0489790 0.1164433 0.421 0.67494
X54 0.0111122 0.1049820 0.106 0.91592
X55 -0.1492262 0.1219943 -1.223 0.22415
X56 0.1014138 0.1054024 0.962 0.33831
X57 -0.0943702 0.1168626 -0.808 0.42130
X58 -0.0054378 0.1076259 -0.051 0.95981
X59 -0.0273607 0.1050842 -0.260 0.79512
X60 0.1303712 0.0915173 1.425 0.15743
X61 -0.0759084 0.1035465 -0.733 0.46524
X62 -0.1552165 0.0965579 -1.607 0.11113
X63 -0.0593877 0.1174564 -0.506 0.61425
X64 0.0379123 0.1078096 0.352 0.72584
X65 0.0074897 0.1024193 0.073 0.94185
X66 -0.0156056 0.1166708 -0.134 0.89387
X67 0.2668173 0.1068497 2.497 0.01417 *
X68 0.0892880 0.1050937 0.850 0.39760
X69 -0.0107980 0.1137296 -0.095 0.92455
X70 0.1842993 0.1116974 1.650 0.10211
X71 0.0785456 0.1126727 0.697 0.48737
X72 -0.0682916 0.1125840 -0.607 0.54552
X73 -0.1886139 0.1080375 -1.746 0.08394 .
X74 0.0001625 0.1019790 0.002 0.99873
X75 0.0185058 0.1007334 0.184 0.85462
X76 0.1129122 0.1024367 1.102 0.27302
X77 0.0319910 0.1054799 0.303 0.76231
X78 0.1498720 0.1034477 1.449 0.15056
X79 -0.1118129 0.1021143 -1.095 0.27618
X80 0.1393635 0.0942044 1.479 0.14222
X81 0.1237867 0.1154491 1.072 0.28623
X82 -0.1098310 0.1010234 -1.087 0.27960
X83 -0.0546254 0.1146525 -0.476 0.63481
X84 0.0167891 0.0940203 0.179 0.85864
X85 0.1235006 0.1206492 1.024 0.30850
X86 -0.0863701 0.1013823 -0.852 0.39631
X87 -0.1943033 0.0979702 -1.983 0.05011 .
X88 0.0260281 0.0967675 0.269 0.78851
X89 0.0537113 0.1027550 0.523 0.60234
X90 0.1756692 0.1116231 1.574 0.11873
X91 -0.0270830 0.0985053 -0.275 0.78394
X92 0.0845792 0.1086375 0.779 0.43810
X93 0.1764917 0.1088027 1.622 0.10796
X94 -0.0659248 0.0985649 -0.669 0.50515
X95 -0.0129704 0.1120624 -0.116 0.90809
X96 -0.0436070 0.0936786 -0.465 0.64260
X97 0.0398414 0.1238983 0.322 0.74846
X98 -0.0441317 0.1022291 -0.432 0.66690
X99 -0.1223492 0.1226529 -0.998 0.32094
X100 0.0403547 0.1075952 0.375 0.70842
X101 0.0417642 0.0996903 0.419 0.67617
X102 -0.1840968 0.1154864 -1.594 0.11410
X103 -0.1982751 0.1051755 -1.885 0.06234 .
X104 0.1093796 0.1032898 1.059 0.29220
X105 -0.1108043 0.1120751 -0.989 0.32524
X106 0.0236541 0.1114778 0.212 0.83240
X107 -0.2165246 0.1064929 -2.033 0.04471 *
X108 -0.0733669 0.0986834 -0.743 0.45897
X109 0.1511465 0.1105149 1.368 0.17452
X110 -0.0411139 0.1131549 -0.363 0.71712
X111 -0.0013255 0.1143815 -0.012 0.99078
X112 -0.0338146 0.1161045 -0.291 0.77148
X113 -0.1075306 0.1042257 -1.032 0.30472
X114 0.0781963 0.1062654 0.736 0.46356
X115 0.1904687 0.1115838 1.707 0.09096 .
X116 0.1119430 0.1213317 0.923 0.35845
X117 -0.0598707 0.0967647 -0.619 0.53752
X118 0.0412772 0.1059136 0.390 0.69758
X119 -0.1395021 0.0998264 -1.397 0.16540
X120 -0.0214465 0.0952355 -0.225 0.82229
X121 -0.0061630 0.1136857 -0.054 0.95688
X122 0.1363865 0.1092217 1.249 0.21471
X123 -0.0715526 0.1069072 -0.669 0.50486
X124 -0.0094576 0.1009250 -0.094 0.92553
X125 -0.0527586 0.1184331 -0.445 0.65695
X126 -0.1747779 0.1083930 -1.612 0.11005
X127 -0.1506615 0.0950797 -1.585 0.11625
X128 -0.1332091 0.1074574 -1.240 0.21804
X129 -0.0227213 0.1078456 -0.211 0.83357
X130 -0.0066388 0.1065831 -0.062 0.95046
X131 0.0056137 0.1008164 0.056 0.95571
X132 -0.1126923 0.1058529 -1.065 0.28964
X133 -0.0478067 0.1131040 -0.423 0.67345
X134 -0.0187423 0.1062751 -0.176 0.86037
X135 -0.0126841 0.1186105 -0.107 0.91505
X136 0.0566899 0.1196393 0.474 0.63666
X137 -0.1878898 0.0999338 -1.880 0.06303 .
X138 -0.0499201 0.0939563 -0.531 0.59639
X139 0.0604688 0.0913343 0.662 0.50947
X140 0.0812187 0.1059396 0.767 0.44511
X141 0.0797878 0.0993465 0.803 0.42383
X142 -0.0936362 0.1125129 -0.832 0.40728
X143 0.0724414 0.1023761 0.708 0.48086
X144 -0.0496059 0.1073318 -0.462 0.64497
X145 -0.1134710 0.1133926 -1.001 0.31942
X146 0.0251678 0.1082066 0.233 0.81656
X147 -0.0618058 0.0991916 -0.623 0.53465
X148 0.0630591 0.1006678 0.626 0.53249
X149 0.1186819 0.1070596 1.109 0.27031
X150 0.2332173 0.1180847 1.975 0.05105 .
X151 -0.0522802 0.1047342 -0.499 0.61877
X152 -0.0024918 0.1146501 -0.022 0.98270
X153 -0.0400985 0.1056010 -0.380 0.70497
X154 -0.0186653 0.1067954 -0.175 0.86161
X155 -0.0713475 0.0961410 -0.742 0.45978
X156 0.0931428 0.0968907 0.961 0.33873
X157 0.1201675 0.1245571 0.965 0.33702
X158 0.0977367 0.1123059 0.870 0.38626
X159 -0.0424229 0.1134814 -0.374 0.70933
X160 -0.1991074 0.1060539 -1.877 0.06341 .
X161 0.0940010 0.1078042 0.872 0.38534
X162 0.0565452 0.1026372 0.551 0.58293
X163 -0.0123934 0.0958702 -0.129 0.89740
X164 0.0930203 0.1087025 0.856 0.39421
X165 -0.0537499 0.1082785 -0.496 0.62071
X166 -0.1673161 0.1210329 -1.382 0.16996
X167 -0.2352736 0.1100655 -2.138 0.03501 *
X168 0.0784007 0.1240198 0.632 0.52874
X169 0.1044123 0.1061787 0.983 0.32783
X170 -0.1811877 0.1059420 -1.710 0.09035 .
X171 0.1611620 0.0952439 1.692 0.09377 .
X172 -0.0506896 0.1122349 -0.452 0.65252
X173 -0.3030179 0.1010782 -2.998 0.00344 **
X174 -0.1222357 0.0984114 -1.242 0.21714
X175 0.0133538 0.0937384 0.142 0.88701
X176 -0.0488612 0.1024626 -0.477 0.63451
X177 -0.1311311 0.1177862 -1.113 0.26828
X178 -0.0731235 0.1077644 -0.679 0.49901
X179 -0.1856217 0.1228441 -1.511 0.13396
X180 -0.2979048 0.1177915 -2.529 0.01301 *
X181 -0.0012166 0.1162361 -0.010 0.99167
X182 -0.2258966 0.1081018 -2.090 0.03921 *
X183 -0.1205862 0.1113622 -1.083 0.28151
X184 -0.0673809 0.1006523 -0.669 0.50477
X185 -0.0071448 0.1259900 -0.057 0.95489
X186 -0.0614695 0.1172649 -0.524 0.60132
X187 -0.0305385 0.1047558 -0.292 0.77126
X188 -0.0613496 0.1057987 -0.580 0.56332
X189 -0.1565336 0.0995340 -1.573 0.11899
X190 -0.0196097 0.0999919 -0.196 0.84492
X191 -0.0235088 0.1087213 -0.216 0.82925
X192 0.0319547 0.0969477 0.330 0.74239
X193 -0.0226363 0.1002944 -0.226 0.82190
X194 0.0274039 0.1121933 0.244 0.80754
X195 0.0265554 0.1049318 0.253 0.80074
X196 -0.0802388 0.1082499 -0.741 0.46030
X197 0.0732440 0.1040338 0.704 0.48306
X198 -0.0222079 0.1008502 -0.220 0.82616
X199 -0.0403121 0.1030368 -0.391 0.69646
[ reached getOption("max.print") -- omitted 1 row ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.072 on 99 degrees of freedom
Multiple R-squared: 0.7195, Adjusted R-squared: 0.1527
F-statistic: 1.269 on 200 and 99 DF, p-value: 0.09151
> yhat = predict(fit, newdata=x.test) # Use linear model and new dataset to get the predicted values, yhat
> mean((yhat-y.test)^2) #Calculate mean squared error (difference between predicted y and true y values is erro
[1] 3.657197
4/ K-Means is a seemingly complicated clustering algorithms. Here is a simpler one:
Given k, the number of clusters, and n, the number of observations, try all possible assignments of the n observations into k clusters. Then, select one of the assignments that minimizes Within-Cluster Variation as defined on page 30.
Assume that you implemented the most naive version of the above algorithm. Here, by naive we mean that you try all possible assignments even though some of them might be redundant (for example, the algorithm tries assigning all of the observations to cluster 1 and it also tries to assign them all to cluster 2 even though those are effectively the same solution).
In terms of n and k, how many potential solutions will your algorithm try?
Answer: k^n
For each of the n observations we have k options for assignment. Each of the assignments is done independently, so k^n.
Note, the exponential explosion in the number of potential solutions is the reason we need to use greedy algorithms like K-Means in order to perform clustering.