R语言练习

R语言练习题

1.请将数据hw1_a和hw1_b分别读入R，查看数据并指出各个变量的形式，最小值，最大值，中值，均值，标准差。

解（1）准备工作

将hw1_a及hw1_b用wps office打开后另存为xls 格式，放入F盘homework1文件夹里面。

（2）编写代码

library(readxl)

library(tidyverse)

library(modelr)

hw1_a <- read_xls("F:/homework1/hw1_a.xls")

hw1_b<- read_xls("F:/homework1/hw1_b.xls")

view(hw1_a)

#运行结果，Income数据有问题，建议用这个代码读取数据：

hw1_a <- read_csv("F:/homework1/hw1_a.csv")

hw1_b <- read_csv("F:/homework1/hw1_b.csv")

view(hw1_a)

#或者

str(hw1_a)

#运行结果

view(hw1_b)

#运行结果

summary(hw1_a)

#查看hw1_a各个变量的最小值、最大值、均值、中值（中位数），运行结果

sd(hw1_a$Income)

#运行结果

summary(hw1_b)

#查看hw1_b各个变量的最小值、最大值、均值、中值（中位数），运行结果

#或者

install.packages("psych")

library(psych)

describe(hw1_a)

#运行结果

2. 结合上课我们所学的几种数据join 的形式，将两个数据集进行合并。对于每种数据合并的方式，请说明key, 并且报告合并后的数据样本总行数。

解 #内连接,key=ID

inner_join(hw1_a,hw1_b,by = "ID")

#左连接，保留hw1_a中的所有观测,key=ID

left_join(hw1_a,hw1_b,by = "ID")

#右连接，保留hw1_b中的所有观测,key=ID

right_join(hw1_a,hw1_b,by = "ID")

#全连接，保留hw1_a、hw1_b中的所有观测,key=ID

full_join(hw1_a,hw1_b,by = "ID")

#保留hw1_a中与hw1_b中的观测相匹配的所有观测,key=ID

semi_join(hw1_a,hw1_b,by = "ID")

#丢弃hw1_a中与hw1_b中的观测相匹配的所有观测,key=ID

anti_join(hw1_a,hw1_b,by = "ID")

3.请筛选出hw1_a 中收入大于4000的样本，并将此样本和hw1_b 中Is_Default=1的样本合并，你可以使用inner join的方式。这一问中你可以用pipe的书写形式。

解 hw1_a1=filter(hw1_a,Income>40000)

hw1_b1=filter(hw1_b,Is_Default==1)

inner_join(hw1_a1,hw1_b1,by="ID")

#运行结果

4.在第2问的基础上, 请给出Income对Years_at_Employer的散点图，你发现了哪些趋势和现象?

解 b<-inner_join(hw1_a,hw1_b,by = "ID")

ggplot()+

geom_point(data=b,mapping = aes(x=Years_at_Employer,y= Income))

#运行结果

结论：工作时间越长，收入水平一般越高。

5.在第4问的基础上按照Is_Default 增加一个维度，请展示两变量在不同违约状态的散点图。请使用明暗程度作为区分方式

解 ggplot()+

geom_point(data=b,mapping = aes(x=Years_at_Employer,y=Income,alpha= Is_Default))

#运行结果

6.对于第5问，请使用形状作为另外一种区分方式。

解 ggplot()+

geom_point(data=b,mapping= aes(x=Years_at_Employer,y=Income,shape=factor( Is_Default)))

7.请找出各个列的缺失值，并删除相应的行。请报告每一变量的缺失值个数，以及所有缺失值总数。

解 c<-full_join(hw1_a,hw1_b,by = "ID")

sum(is.na(c[1]))

sum(is.na(c[2]))

sum(is.na(c[3]))

sum(is.na(c[4]))

sum(is.na(c[5]))

sum(is.na(c[6]))

sum(is.na(c[7]))

#运行结果

sum(is.na(c))

filter(c,c!= "NA" )

c<-full_join(hw1_a,hw1_b,by = "ID")

c1<-filter(c,!is.na(c[1]))

c2=filter(c1,!is.na(c1[2]))

c3=filter(c2,!is.na(c2[3]))

c4=filter(c3,!is.na(c3[4]))

c5=filter(c4,!is.na(c4[5]))

c6=filter(c5,!is.na(c5[6]))

c7=filter(c6,!is.na(c6[7]))

filter(c7,!is.na(c7[8]))

#或者

filter(c,!is.na(c[1]),!is.na(c[2]),!is.na(c[3]),!is.na(c[4]),!is.na(c[5]),!is.na(c[6]),!is.na(c[7]),!is.na(c[8]))

8.找出Income中的极端值并滤掉对应行的数据

解 quantile(hw1_a$Income,c(0.025,0.975))

hw1_a2=filter(hw1_a,(Income>14168.81)&(Income<173030.9))

#找出p(x<k)=0.025,p(x<l)=0.975对应的x,l，找出收入介于x,l的行数据。

9.将Income对数化，并画出直方图和density curve.

解j<-hw1_a$Income

k<-log(j)

hist(k,prob=T)

#prob=T,纵坐标为所占比例，矩形总面积等于1

lines(density(k),col="blue")

#画密度曲线，曲线颜色为蓝色。

10.以Income作为因变量，Years at Employer作为自变量，进行OLS回归，写出回归的方程，并指出自变量系数是否在某一显著性水平上显著。同时，解释你的结果(这一问你自己发挥可以找code解决)。

解 z<-lm(data=hw1_a,Income~Years_at_Employer)

coef(z)

summary(z)

# P值<0.01差异性极显著

这说明，一般，工作年限越长，收入水平越高。工作时间越长，经验越丰富，能为企业创造更多的价值，薪酬水平自然越高。

1.(1)编写函数get.root(a,b,c),求解一元二次方程ax^2+bx+c=0的实根。

解 get.root<-function(a,b,c){

p=b^2-4*a*c

if(a==0) {print("错误，方程ax^2+bx=c=0不是一元二次方程" ) }

else {

if(p<0){print("错误，一元二次方程ax^2+bx=c=0无实根")}

if(p==0){print("一元二次方程ax^2+bx=c=0有2个相等实根")

x<- (-b)/(2*a)

print(x)

}

if(p>0){print("一元二次方程ax^2+bx=c=0有2个不相等实根")

x1<-(-b+sqrt(p))/(2*a)

x2<-(-b-sqrt(p))/(2*a)

print(x1)

print(x2)

}}

}

#测试程序

get.root(0,1,2)

get.root(1,1,2)

get.root(1,2,1)

get.root(1,-3,2)

#运行结果

1.(2)已知某一元二次方程ax^2+bx+c=0的三个系数都是随机变量，其中a服从[1,5]上的均匀分布，b服从正态分布N(3,10),c服从均值为1的指数分布。请编写函数get.prob(),计算该方程有实根的概率。

解get.prob<-function(n){

set.seed(12) #设立随机种子，让结果可重复出现

a <- runif(n,min = 1,max = 5)

b <- rnorm(n,mean = 3,sd = sqrt(10))

c <- rexp(n,rate=1) #rate为率参数，等于均值分之一

d<-sum((b^2-4*a*c)>=0)

p<-d/n

print(p)

}

#a,b,c各取100万个数计算结果。

get.prob(1000000)

#运行结果

2.(1)请读入数据，使用软件分别给出 price, marketshare，和brand的缺失值数量。请按照每一个brand, 将数据按照先marketshare 后price 进行从高到低排序。

解（1）准备工作

将data for HW2用wps office打开后另存为xls格式，放入F盘homework2文件夹里面。

（2）编写代码

library(readxl)

library(tidyverse)

library(modelr)

hw2 <- read_xls("F:/homework2/data for HW2.xls")

view(hw2)

#运行结果

a1 <-sum(is.na(hw2$price)) #计算price缺失值数量

a2 <-sum(is.na(hw2$marketshare)) #计算marketshare缺失值数量

a3 <-sum(is.na(hw2$brand)) #计算brand缺失值数量

#运行结果

arrange(hw2,brand,desc(marketshare),desc(price))

#运行结果

(2)请按照brand 的种类，对price和marketshare 求均值。

解hw2_1 <- filter(hw2, brand==1)#求brand==1时price.marketshare均值

b1= mean(hw2_1$price, na.rm = TRUE)

c1= mean(hw2_1$marketshare, na.rm = TRUE)

#运行结果

hw2_2 <- filter(hw2, brand==2) #求brand==2时price.marketshare均值

b2= mean(hw2_2$price, na.rm = TRUE)

c2= mean(hw2_2$marketshare, na.rm = TRUE)

#运行结果

hw2_3 <- filter(hw2, brand==3) #求brand==3时price.marketshare均值

b3= mean(hw2_3$price, na.rm = TRUE)

c3= mean(hw2_3$marketshare, na.rm = TRUE)

#运行结果

hw2_4 <- filter(hw2, brand==4) #求brand==4时price.marketshare均值

b4= mean(hw2_4$price, na.rm = TRUE)

c4= mean(hw2_4$marketshare, na.rm = TRUE)

#运行结果

(3)请按照brand 的种类，对price和marketshare 画散点图。

解hw2_5 <- filter(hw2, (brand!="NA")&(price!="NA")&(marketshare!="NA"))

#将数据框hw2中含有空值的行删去，组成新的数据框hw2_5

ggplot(data = hw2_5) +

geom_point(mapping = aes(x = price,y = marketshare, color=brand))

#运行结果

#如需对brand分面绘图显示，可用以下代码：

ggplot(data =hw2_5) +

geom_point(mapping = aes(x = price,y = marketshare))+facet_wrap(~brand,nrow = 2)

#运行结果

[if !supportLists](4) [endif]请按照价格的均值，产生新的变量price_new, 低于均值为“低价格”，高于均值为“高价格”。同样对市场份额也是，产生变量marketshare_new, 数值为“低市场份额”和“高市场份额”。

解#对数据框hw2中含有空值的行删去后得到的新数据框hw2_5进行研究，按要求产生新变量。

d<-NA

e<-NA

i<-1

for(i in 1:nrow(hw2_5)) #nrow(hw2_5)为hw2_5的行数

{

if(hw2_5$brand[i]==1)

{ if(hw2_5$price[i]>b1)

{d[i]="高价格"}

if(hw2_5$price[i]==b1)

{d[i]="中价格"}

if(hw2_5$price[i]<b1)

{d[i]="低价格"}

}

if(hw2_5$brand[i]==2)

{ if(hw2_5$price[i]>b2)

{d[i]="高价格"}

if(hw2_5$price[i]==b2)

{d[i]="中价格"}

if(hw2_5$price[i]<b2)

{d[i]="低价格"}

}

if(hw2_5$brand[i]==3)

{ if(hw2_5$price[i]>b3)

{d[i]="高价格"}

if(hw2_5$price[i]==b3)

{d[i]="中价格"}

if(hw2_5$price[i]<b3)

{d[i]="低价格"}

}

if(hw2_5$brand[i]==4)

{ if(hw2_5$price[i]>b4)

{d[i]="高价格"}

if(hw2_5$price[i]==b4)

{d[i]="中价格"}

if(hw2_5$price[i]<b4)

{d[i]="低价格"}

}

if(hw2_5$brand[i]==1)

{ if(hw2_5$marketshare[i]>c1)

{e[i]="高市场份额"}

if(hw2_5$marketshare[i]==c1)

{e[i]="中市场份额"}

if(hw2_5$marketshare[i]<c1)

{e[i]="低市场份额"}

}

if(hw2_5$brand[i]==2)

{ if(hw2_5$marketshare[i]>c2)

{e[i]="高市场份额"}

if(hw2_5$marketshare[i]==c2)

{e[i]="中市场份额"}

if(hw2_5$marketshare[i]<c2)

{e[i]="低市场份额"}

}

if(hw2_5$brand[i]==3)

{ if(hw2_5$marketshare[i]>c3)

{e[i]="高市场份额"}

if(hw2_5$marketshare[i]==c3)

{e[i]="中市场份额"}

if(hw2_5$marketshare[i]<c3)

{e[i]="低市场份额"}

}

if(hw2_5$brand[i]==4)

{ if(hw2_5$marketshare[i]>c4)

{e[i]="高市场份额"}

if(hw2_5$marketshare[i]==c4)

{e[i]="中市场份额"}

if(hw2_5$marketshare[i]<c4)

{e[i]="低市场份额"}

}

i=i+1}

hw2_4_0<-mutate(hw2_5,

price_new = d,

marketshare_new=e

)

arrange(hw2_4_0,brand,desc(marketshare),desc(price))

#运行结果

（5）请估计模型，marketshare为Y，price为X.

解#预估线性模型

x1 <- lm(marketshare ~ price, data = hw2_5)

coef(x1)

#运行结果

[if !supportLists]（6）[endif]请画出（5）的拟合直线。

解x2<-coef(x1)

ggplot(data = hw2_5) +

geom_point(mapping = aes(x = price,y = marketshare, color=brand))+ geom_abline(data= x1,color= "blue")

#运行结果

[if !supportLists]（7）[endif]请随机产生若干直线，验证（5）的结果是最优的

解models <- tibble(

a1=runif(200,-0.1,0.1),

a2=runif(200,-5,5)

)

ggplot(data = hw2_5)+

geom_point(mapping = aes(x = price, y = marketshare))+

geom_abline(

aes(intercept=a1,slope=a2),

data=models,alpha=1/4

)

#随机产生200条直线

#运行结果

model1 <- function(a,hw2_5){

a[1]+hw2_5$price*a[2]

}

measure_distance <- function(mod,data){

diff <- hw2_5$marketshare-model1(mod,data)##真实y值减去y的预测值

sqrt(mean(diff^2))

}

hw2_5_dist <- function(a1,a2){

measure_distance(c(a1,a2),hw2_5)

}

models <- models %>%

mutate(dist=purrr::map2_dbl(a1,a2,hw2_5_dist))

ggplot(hw2_5,aes(price,marketshare))+

geom_point(size=2,color="grey30")+

geom_abline(

aes(intercept=a1,slope=a2,color=-dist),

data=filter(models,rank(dist)<=10)

)

#运行结果

best <- optim(par=c(0,0),measure_distance, data=hw2_5)

best$par

#运行结果

ggplot(hw2_5,aes(price,marketshare))+

geom_point(size=2,color="grey30")+

geom_abline(intercept=best$par[1],slope = best$par[2], col="green")

[if !supportLists]（8）[endif]请估计模型，marketshare为Y，price和brand 为X.

解mod1 <- lm(marketshare~price+brand,hw2_5)

mod2 <- lm(marketshare~price*brand,hw2_5)

grid <- hw2_5 %>%

data_grid(price,brand) %>%

gather_predictions(mod1,mod2)

ggplot(hw2_5,aes(price,marketshare,color=as.factor(brand)))+

geom_point()+

geom_line(data=grid,aes(y=pred))+

facet_wrap(~model)

hw2_5 <- hw2_5 %>%

gather_residuals(mod1,mod2)

ggplot(hw2_5,aes(price,resid,color=as.factor(brand)))+

geom_point()+

facet_grid(model~as.factor(brand))

问题1：

A 和 B 约定在某篮球场见面。他俩都不太守时，出现时间服从均匀分布。他俩也都没有耐心，每个人都会只等对方十分钟就会离开。已知 A 到篮球场的时间为下午 4 点到 5点之间。

(1) 如果 B 到达篮球场的时间也为下午 4 点到 5 点之间，模拟运行 50000 次，看看他们成功相遇的概率。

解 set.seed(12) #设定随机数种子,让模拟可重复出现

x <- runif(50000,min = 0,max = 60)

y <- runif(50000,min = 0,max = 60)

a <- sum(abs(x-y)<=10)

p=a/50000

#运行结果

(2) 对上一问的 50000 次模拟，用不同颜色在一张图中展示成功相遇与否。

解 #引入变量z，当A与B相遇时候，取值为1，否则，取值为0

z<-NA

i<-1

for(i in 1:50000){

if(abs(x[i]-y[i])<=10)

z[i]=1

else

z[i]=0

i=i+1

}

b<-data.frame(x,y,z,stringsAsFactors =TRUE)

library(tidyverse)

ggplot(data = b) + geom_point(mapping = aes(x = x,y = y, color=z))

#运行结果,中间蓝色区域代表相遇，黑色区域代表不相遇

(3) B 应该如何选择 4 点到 5 点之间的哪个时间段，来提升他们成功相遇的概率？用模拟展示你的理由

解#将4点到5点，平均分成12个时间段，计算B在这个时间段与A相遇概率。

y<-NA

c<-NA

d<-NA

j<-1

k<-1

for(j in 1:12){

y <- runif(50000,min = (j-1)*5,max =j*5)

c<- sum(abs(x-y)<=10)

d[j]=c/50000

j=j+1

}

#运行结果

#由上述结果可以看出，d的第三项到第十项较大，结果上也基本上相近，说明B 应该选择 4 点10分到 4点50分钟之间到达篮球场，相遇概率较大。

问题2：

请使用nycflights13 和 pipe 语法

（1）从 flights 数据表中挑选出以下变量：(year, month, day, hour, origin, dep_delay, distance, carrier)，将生产的新表保存为 flight1。

解library(nycflights13)

flight1<-select(flights,year,month,day,hour,origin,dep_delay, distance,carrier)

flight1

#运行结果

（2）从 weather 数据表中挑选出以下变量： (year, month, day, hour, origin, humid, wind_speed)，将生产的新表保存为weather1。

解

weather1<-select(weather,year, month, day, hour, origin, humid, wind_speed)

weather1

#运行结果

（3）将 flight1 表和 weather1 表根据共同变量进行内连接，随机抽取 100000 行数据，将生产的结果保存为 flight_weather。 (提示：sample_n()函数，不用重复抽取)

解a<-inner_join(flight1,weather1, by=c("year", "month", "day", "hour", "origin"))

flight_weather<-sample_n(a,10000,replace = FALSE)

flight_weather

#运行结果

（4）从 flight_weather 表中对三个出发机场按照平均出发延误时间排降序，并将结果保留在 longest_delay 表中。把结果展示出来。

解longest_delay<-arrange(flight_weather,desc(dep_delay))

longest_delay #运行结果

（5）根据出发地（origin）在同一个图中画出风速wind_speed（x 轴）和出发延误时间dep_delay（y 轴）的平滑曲线图。

解ggplot(data=flight_weather)+geom_smooth(mapping=aes(x=wind_speed,y=dep_delay,linetype=origin),se=FALSE)

#运行结果

（6）根据不同出发地（origin）在平行的 3 个图中画出风速 wind_speed（x 轴）和出发延误时间 dep_delay（y 轴）的散点图。

解ggplot(data =flight_weather) +

geom_point(mapping = aes(x = wind_speed,y =dep_delay)) +

facet_wrap(~origin,nrow = 1)

#运行结果

（7）根据 flight_weather 表，画出每个月航班数的直方分布图，x 轴为月份，y 轴是每个月份航班数所占的比例。

解month1<-flight_weather$month

hist(month1,prob=T)

#运行结果

（8）根据 flight_weather 表，画出每个月航班距离的 boxplot 图，x 轴为月份，y 轴为航行距离, 根据的航行距离的中位数从低到高对 x 轴的月份进行重新排序。

解ggplot(data =flight_weather) +

geom_boxplot(mapping = aes(x = month,y =distance))

#运行结果

ggplot(data =flight_weather) +

geom_boxplot(mapping=aes(x=reorder(month,distance,FUN=median),y=distance))

#运行结果

解(1)

H <- function(p){

-p[1]*log(p[1])-p[2]*log(p[2])-p[3]*log(p[3])-p[4]*log(p[4])-p[5]*log(p[5])}

(2)

KLD <- function(p,q){

p[1]*(log(p[1])-log(q[1]))+p[2]*(log(p[2])-log(q[2]))+p[3]*(log(p[3])-log(q[3]))+p[4]*(log(p[4])-log(q[4]))+p[5]*(log(p[5])-log(q[5]))}

(3)

a1<-c(0.2,0.2,0.2,0.2,0.2)

a2<-c(0.8,0.1,0.05,0.025,0.025)

a3<-c(0.05,0.15,0.7,0.05,0.05)

b<-data.frame(a1,a2,a3,stringsAsFactors =TRUE)

purrr::map(b,H)

#运行结果

#地区1的熵最大，其次是地区3的熵，最小的是地区2的熵。说明，地区1病毒分布是最混乱的。

(4)

c<-data.frame(a1,a1,a1,stringsAsFactors =TRUE)

d<-data.frame(a2,a2,a2,stringsAsFactors =TRUE)

e<-data.frame(a3,a3,a3,stringsAsFactors =TRUE)

f<-purrr::map2(b,c,KLD)

g<-purrr::map2(b,d,KLD)

h<-purrr::map2(b,e,KLD)

i<-data.frame(f,g,h,stringsAsFactors =TRUE)

rownames<-c("区域1","区域2","区域3")

colnames<-c("区域1","区域2","区域3")

matrix(i,nrow=3,ncol=3,byrow=FALSE,dimnames = list(rownames,colnames))

#运行结果

#矩阵第i行第j列，表示区域i到区域j变种分布距离

R语言练习

推荐阅读更多精彩内容