k折交叉验证 modelr和broom包

2016年12月3日
使用modelr和扫帚的k折交叉验证
@drsimonj这里讨论如何进行k-折交叉验证，对通过评估模型支持重点大卫·罗宾逊的扫帚包。完全学分制此亦大卫，因为这是一个稍微更详细的版本他的过去后，这是我以前看过一段时间，感觉就像拆包。
假设知识：K折交叉验证
这个帖子假设你知道什么k折交叉验证。如果你想刷了，这里有一个从斯坦福大学的教授特雷弗·黑斯蒂和Rob Tibshirani梦幻般的教程。
创建折叠
担心模型之前，我们就可以产生k个折叠使用crossv_kfold
从modelr包。让我们从实践mtcars
数据让事情变得简单。

library(modelr)
set.seed(1)  # Run to replicate this post
folds <- crossv_kfold(mtcars, k = 5)
folds
#> # A tibble: 5 × 3
#>            train           test   .id
#>           <list>         <list> <chr>
#> 1 <S3: resample> <S3: resample>     1
#> 2 <S3: resample> <S3: resample>     2
#> 3 <S3: resample> <S3: resample>     3
#> 4 <S3: resample> <S3: resample>     4
#> 5 <S3: resample> <S3: resample>     5

该函数接受一个数据帧，并随机划分它的行（1到32 mtcars
）到k
大致相等组。我们已经分配的行数为k = 5
组。结果作为像上面那样的小块（数据帧）返回。
在每个单元test
列包含一个resample
对象，这是参照在数据帧的行的子集的一种有效的方法（ ?resample
，以了解更多）。将每个单元视为对属于每个分区的数据帧的行的引用。例如，下面的告诉我们，数据引用行5,9，17，20，27，28，29的第一个分区，它占大致1 / k
的总数据集（32行的7）的。
folds$test[[1]] #> <resample [7 x 11]> 5, 9, 17, 20, 27, 28, 29

在每个单元train
还包含一个resample
对象，但引用的所有其他分区的行。例如，第一个train
对象引用的所有行，除了 5，9，17，20，27，28，29：

folds$train[[1]] 
#> <resample [25 x 11]> 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, ...

现在，我们可以运行在每个引用的数据模型train
的对象，并验证在每个相应分区模型结果test
。
将训练数据拟合模型
假设我们感兴趣的是预测万里每加仑（ mpg
与所有其他变量）。使用整个数据集，我们可以通过以下方式实现：

lm(mpg ~ ., data = mtcars)

相反，我们希望运行的每个训练数据集（每个引用的数据这个模型train
细胞）。我们可以这样做：

library(dplyr)
library(purrr)

folds <- folds %>% mutate(model = map(train, ~ lm(mpg ~ ., data = .)))
folds
#> # A tibble: 5 × 4
#>            train           test   .id    model
#>           <list>         <list> <chr>   <list>
#> 1 <S3: resample> <S3: resample>     1 <S3: lm>
#> 2 <S3: resample> <S3: resample>     2 <S3: lm>
#> 3 <S3: resample> <S3: resample>     3 <S3: lm>
#> 4 <S3: resample> <S3: resample>     4 <S3: lm>
#> 5 <S3: resample> <S3: resample>     5 <S3: lm>

folds %>% mutate(model = ...)
中添加一个新的model
列的褶皱tibble。
map(train, ...)
被施加函数到每个细胞中的train

~ lm(...)
是施加到每个回归模型train
细胞。
data = .
指定的回归模型的数据将是由每个所引用的数据train
对象。

其结果是一个新的model
包含基于每个的嵌合回归模型列train
的数据（即，整个数据集不包括每个分区）。
例如，拟合我们的第一组训练数据的模型是：

folds$model[[1]] %>% summary()
#> 
#> Call:
#> lm(formula = mpg ~ ., data = .)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.6540 -0.9116  0.0439  0.9520  4.2811 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) -44.243933  31.884363  -1.388   0.1869  
#> cyl           0.844966   1.064141   0.794   0.4404  
#> disp          0.016800   0.015984   1.051   0.3110  
#> hp            0.004685   0.022741   0.206   0.8398  
#> drat          3.950410   1.989177   1.986   0.0670 .
#> wt           -4.487007   2.016341  -2.225   0.0430 *
#> qsec          2.327131   1.243095   1.872   0.0822 .
#> vs           -3.963492   3.217176  -1.232   0.2382  
#> am           -0.550804   2.333252  -0.236   0.8168  
#> gear          5.476604   2.648708   2.068   0.0577 .
#> carb         -1.595979   1.104272  -1.445   0.1704  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.159 on 14 degrees of freedom
#> Multiple R-squared:  0.9092, Adjusted R-squared:  0.8444 
#> F-statistic: 14.02 on 10 and 14 DF,  p-value: 1.205e-05

预测测试数据
下一步骤是使用每个模型用于在相应的预测结果变量test
数据。有很多方法来实现这一点。一种一般方法可能是：

folds %>% mutate(predicted = map2(model, test, <FUNCTION_TO_PREDICT_TEST_DATA> ))

map2(model, test, ...)
通过每个迭代模型和一组test
数据并行。通过在功能引用这些用于预测测试数据，这将增加一个predicted
与预测的结果列。
对于许多普通车型，优雅的替代方法是使用augment
的扫帚。对于回归，augment
将采取拟合模型和一个新的数据帧，并返回预测结果的数据帧，这是我们想要的！上面下面，我们就可以使用augment
如下：

library(broom)

folds %>% mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y)))
#> # A tibble: 5 × 5
#>            train           test   .id    model             predicted
#>           <list>         <list> <chr>   <list>                <list>
#> 1 <S3: resample> <S3: resample>     1 <S3: lm> <data.frame [7 × 13]>
#> 2 <S3: resample> <S3: resample>     2 <S3: lm> <data.frame [7 × 13]>
#> 3 <S3: resample> <S3: resample>     3 <S3: lm> <data.frame [6 × 13]>
#> 4 <S3: resample> <S3: resample>     4 <S3: lm> <data.frame [6 × 13]>
#> 5 <S3: resample> <S3: resample>     5 <S3: lm> <data.frame [6 × 13]>

要提取这些相关信息predicted
结果，我们将unnest
数据帧感谢tidyr包：

library(tidyr)

folds %>%
  mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y))) %>% 
  unnest(predicted)
#> # A tibble: 32 × 14
#>      .id   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1      1  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 2      1  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 3      1  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
#> 4      1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#> 5      1  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
#> 6      1  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#> 7      1  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
#> 8      2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 9      2  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 10     2  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> # ... with 22 more rows, and 2 more variables: .fitted <dbl>,
#> #   .se.fit <dbl>

这是为了向您展示中间步骤。在实践中，我们可以跳过mutate
步：

predicted <- folds %>% unnest(map2(model, test, ~ augment(.x, newdata = .y)))
predicted
#> # A tibble: 32 × 14
#>      .id   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1      1  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#> 2      1  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 3      1  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
#> 4      1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#> 5      1  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
#> 6      1  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#> 7      1  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
#> 8      2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
#> 9      2  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 10     2  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
#> # ... with 22 more rows, and 2 more variables: .fitted <dbl>,
#> #   .se.fit <dbl>

我们现在有一个tibble test
数据对于每个倍（ .id
=褶皱数），并且相应.fitted
的结果变量（或预测值mpg
在每一种情况下）。
正在验证模型
我们可以计算和检查残差：

# Compute the residuals
predicted <- predicted %>% 
  mutate(residual = .fitted - mpg)

# Plot actual v residual values
library(ggplot2)
predicted %>%
  ggplot(aes(mpg, residual)) +
    geom_hline(yintercept = 0) +
    geom_point() +
    stat_smooth(method = "loess") +
    theme_minimal()

unamed-chunk-12-1.png

它看起来像我们的模型可能高估mpg
约20-30和低估更高mpg
。但是显然较少的数据点，使预测困难。
我们还可以使用这些值来计算每个模型所占的方差的总比例：

rs <- predicted %>%
  group_by(.id) %>% 
  summarise(
    sst = sum((mpg - mean(mpg)) ^ 2), # Sum of Squares Total
    sse = sum(residual ^ 2),          # Sum of Squares Residual/Error
    r.squared = 1 - sse / sst         # Proportion of variance accounted for
    )
rs
#> # A tibble: 5 × 4
#>     .id      sst       sse r.squared
#>   <chr>    <dbl>     <dbl>     <dbl>
#> 1     1 321.5886 249.51370 0.2241214
#> 2     2 127.4371  31.86994 0.7499164
#> 3     3 202.6600  42.19842 0.7917773
#> 4     4 108.4733  50.79684 0.5317113
#> 5     5 277.3283  59.55946 0.7852385

# Overall
mean(rs$r.squared)
#> [1] 0.616553

因此，在整个褶皱，回归模型已经占到平均方差的61.66％， mpg
在新的测试数据。
绘制这些结果：

rs %>% 
  ggplot(aes(r.squared, fill  = .id)) +
    geom_histogram() +
    geom_vline(aes(xintercept = mean(r.squared)))  # Overall mean

未命名的chunk-14-1.png

虽然模型平均表现良好，但当折叠1用作测试数据时，它的表现相当差。
一次全部
有了这个新的知识，我们可以做类似的东西k = 20
中所示的情况下，大卫的帖子。看到你可以理解这里发生了什么：

set.seed(1)
# Select four variables from the mpg data set in ggplot2
ggplot2::mpg %>% select(year, cyl, drv, hwy) %>% 
  # Create 20 folds (5% of the data in each partition)
  crossv_kfold(k = 20) %>%
  # Fit a model to training data
  mutate(model = map(train, ~ lm(hwy ~ ., data = .))) %>%
  # Unnest predicted values on test data
  unnest(map2(model, test, ~ augment(.x, newdata = .y))) %>% 
  # Compute R-squared values for each partition
  group_by(.id) %>%
  summarise(
    sst = sum((hwy - mean(hwy)) ^ 2),
    sse = sum((hwy - .fitted) ^ 2),
    r.squared = 1 - sse / sst
  ) %>% 
  # Plot
  ggplot(aes(r.squared)) +
    geom_density() +
    geom_vline(aes(xintercept = mean(r.squared))) +
    theme_minimal()

最后编辑于：2017.12.04 22:41:09

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 218,122评论 6赞 505
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,070评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,491评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,636评论 1赞 293
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,676评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,541评论 1赞 305
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,292评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,211评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,655评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,846评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,965评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,684评论 5赞 347
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,295评论 3赞 329
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,894评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,012评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,126评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,914评论 2赞 355

k折交叉验证 modelr和broom包

推荐阅读更多精彩内容