solution to ISLR

Chapter 2

1,

a) better, the more samples can make the function fit pratcal problem better.

b) worse, since the number of observations is small, the more flexiable statistical method will result in the more over-fit function.

c) better, the more samples enable the flexiable method to fit the data better.

d) worse, due to the variance is high, with a fixed number of samples, the more flexible the statistical method is, the more function will overfit.

2,

a) regression, n: 500 firms in US; p: profit, number of employees, industry.

b) classification, n: 20 products that were previously launched; p: price, marketing budget, competition price, ten other variables.

c) regression, n: weekly data for all of 2012; p: % change in the US market, the % change in British market, the % change in the German market.

3,

a) the Bayes Error is going to be a fixed horizonal line which goes in parallel with the x axis, since it is a constant value; training error goes down with the flexibility goes up, since the stastics method fits the training data more precisely with the increasing flexibility; test error first goes down but increase after a point which near the line of Bayes Error due to over-fitting; the variance starts low but goes up constantly because the more method fit the traininig data, the great difference will be when change training data set.

b) as previous.

4,

a) i. recognize the number in one image; inference; ii. provide the statistcs of history data, judge if a project will succeed; prediction; iii. recommendation system by categorize the user; prediction

b) i. get a function with provided statistics; inference; ii. predict the stock prize of future; prediction; iii. give the possible price of next year of one house; prediction

c) i. find types of virus; ii. judge the possible geographical boundary between people holding different political belief; iii. categorize users by their taste for book

5,

Advantage: can match the test data better; Disadvantage: easy to over-fit if the data is not enough; high variance; require high computatiblity to get parameters.

A non-linear and complicated module need more flexible approach; otherwise a less flexible approach is preferred.

6,

Parametric statistics learning method:

pros: simpler, speed, less data; cons: constrained, limited comlexity, poor fit

Non-parametric statistics learning method:

pros: flexibility, power, performance; cons: more data, slower, overfitting.

7,

a) 1: 3; 2: 2; 3: 10^(1/2); 4: 5^(1/2); 5: 2^(1/2); 6: 3^(1/2)

b) Green, since Obs.5 is the closest point to the test data.

c) Red, since Obs.2, 5, 6 is the cloeset 3 points to the test data, and both 2 and 6 are red.

9,

a) str(auto);  #all variables are quantitative variable excpet name and horsepower
b) summary(auto[, -c(4,9)])

c) sapply(auto[, -c(4, 9)], mean)

sapply(auto[, -c(4, 9)], sd)

mpg     cylinders displacement weight     acceleration
7.8258039 1.7015770 104.3795833 847.9041195 2.7499953
year      origin
3.6900049 0.8025495

d)

sapply(auto[-c(10:85), -c(4, 9)], range)  

sapply(auto[-c(10:85), -c(4, 9)], mean)

sapply(auto[-c(10:85), -c(4, 9)], sd)

Chapter 3

1,

NULL hypothesis is that the predictors "TV", "radio", "newspaper" have no effect on sale.
Conclusion: p-value are not significatnt for "TV" and "Radio", so they have great probability have effect on sales, while we can't reject newspaper does not have effect.

2,

The KNN classifier is typically used to solve classification problems (those with a qualitative response) by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class jj as the fraction of points in the neighborhood whose response values equal jj. The KNN regression method is used to solve regression problems (those with a quantitative response) by again identifying the neighborhood of x0x0 and then estimating f(x0) as the average of all the training responses in the neighborhood.

3,

a) iii is right

b) 137.1k

c) false, to identify whether predictor has effect on response, we should look at p-value.

4,

a) The RSS of cubic regression will be lower, since it over-fits the training data.

b) it will be inverse, i.e., the RSS for linear regression have lower RSS.

c) Polynomial regression has lower train RSS than the linear fit because of higher flexibility.

d) it depends on which model is closer to the real model.

6,
f(ave(x)) = b0 + b1ave(x) = ave(y) - b1ave(x) + b1*ave(x) = ave(y)

7,
a)
attach(Auto)
lm.fit = lm(mpg~horsepower)
summary(lm.fit)
i.
since p-value is not significant, we may assume there is a relationship between the predictor and the response.
ii.
R^2 is 0.6059, which indicates that 60.59% of the variability of mpg can be explained by horsepower.
iii.
negative
iv.
predict(lm.fit, data.frame(horsepower=98), interval="confidence")
fit lwr upr
1 24.46708 14.8094 34.12476

predict(lm.fit, data.frame(horsepower=98), interval="prediction")
fit lwr upr
1 24.46708 23.97308 24.96108
b)
plot(horsepower,mpg)
abline(lm.fit)

10,
c) Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε
with Urban=1Urban=1 if the store is in an urban location and 00 if not, and US=1US=1 if the store is in the US and 00 if not.
d) Price and US
f) The model can explain 23.93% of Sales's variance.
g)
confint(fit2)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632

11,

c) We obtain the same value for the t-statistic and consequently the same value for the corresponding p-value. Both results in (a) and (b) reflect the same line created in (a). In other words, y=2x+εy=2x+ε could also be written x=0.5(y−ε)x=0.5(y−ε).

15,

c)

simple.reg <- vector("numeric",0)
simple.reg <- c(simple.reg, fit.zn$coefficient[2])
simple.reg <- c(simple.reg, fit.indus$coefficient[2])
simple.reg <- c(simple.reg, fit.chas$coefficient[2])
simple.reg <- c(simple.reg, fit.nox$coefficient[2])
simple.reg <- c(simple.reg, fit.rm$coefficient[2])
simple.reg <- c(simple.reg, fit.age$coefficient[2])
simple.reg <- c(simple.reg, fit.dis$coefficient[2])
simple.reg <- c(simple.reg, fit.rad$coefficient[2])
simple.reg <- c(simple.reg, fit.tax$coefficient[2])
simple.reg <- c(simple.reg, fit.ptratio$coefficient[2])
simple.reg <- c(simple.reg, fit.black$coefficient[2])
simple.reg <- c(simple.reg, fit.lstat$coefficient[2])
simple.reg <- c(simple.reg, fit.medv$coefficient[2])
mult.reg <- vector("numeric", 0)
mult.reg <- c(mult.reg, fit.all$coefficients)
mult.reg <- mult.reg[-1]
plot(simple.reg, mult.reg, col = "red")

Chapter 4

4,

a)

for x in [0.5,0.95]: 10%;

for x < 0.5, [0,x]: (x+5)%;

for x > 0.95, [x,1]: (100-x+5)% = (105-x)%;

so, the average fraction of the available observations which we will use to make prediction are

10% * 0.9 + ave((x+5)%) * 0.05 + ave((105-x)%) * 0.05 = 9% + 7.5% * 0.05 + 7.5% * 0.05 = 9.75%.

b)

9.75% * 9.75% = 9.50625%

c)

9.75%^100 ≈ 0%

d)

limp->∞9.75p = 0

e)

1: 10%,

2: √10%

100: log1000.1

5,

a) QDA, LDA.

b) QDA, QDA.

c) It depends. If the true model is linear, LDA may perform better than QDA, but roughly speaking, when given large training set, the QDA can perform better since it is more flexible.

d) true. When given large training set, this probably happens.

6,

a)

P = exp(-6 + 0.05 * 40 + 1 * 3.5) / (exp(-6 + 0.05 * 40 + 1 * 3.5) + 1) = 0.3775.

b)

-6 + 0.05 * x + 3.5 * 1 = 0 --> x = 50

7,

0.752

8,

KNN with K = 1 has 0% trainning error rate. But its average error rate is 18%, so its test error rate is 36%. So we choose to use logistic regression since it only has 30% test error rate.

9,

a)

p(x)/(1-p(x)) = 0.37 --> p(x) = 0.27

b)

16% / (1-16%) = 0.19

10,

b) Yes, Lag2, since its p-value is less than 0.05.

c)

pred = predict(lm.fit, type="response")

lm.pred = rep("Down", length(pred))

lm.pred[pred>0.5] = "Up"

table(lm.pred, Direction)

We may conclude that the percentage of correct predictions on the training data is (54+557)/1089 wich is equal to 56.1065197%. In other words 43.8934803% is the training error rate, which is often overly optimistic. We could also say that for weeks when the market goes up, the model is right 92.0661157% of the time 557/(48+557). For weeks when the market goes down, the model is right only 11.1570248% of the time 54/(54+430).

Chapter 5

the solution to chapter 5 has gotten lost due to my misbehavior in CNBlog. The exercise in chapter 5 is not compound, so I will not rewrite it.

Chapter 6

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,324评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,356评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,328评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,147评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,160评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,115评论 1 296
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,025评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,867评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,307评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,528评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,688评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,409评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,001评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,657评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,811评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,685评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,573评论 2 353

推荐阅读更多精彩内容

  • 对于身边一些人那些让我看不过去的行为,我每天都努力地去提醒自己不要在意,不要把关系搞僵,毕竟大家低头不见抬头见,自...
    酉悦阅读 228评论 0 0