2020-09-01-Dealing With Missing Values Using Amelia

Homework #11 Dealing With Missing Values Using Amelia

Robert Perez

April 25th, 2018

Introduction

The dataset used was Airquality which collected daily air quality measurements in New York, from May to September in the year of 1973. Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973. The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). I will use this dataset to examine the factors that Air quality. This dataset is not perfect therefore we will deal with the missing values using the Amelia Package.

Variables

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Month

Day

Research Question

Which factors effect the air quality in New York over a 30 day span?

Hide

library(tidyverse)
library(Zelig)
library(Amelia)
library(pander)
library(texreg)
library(visreg)
library(lmtest)
library(sjmisc)
library(radiant.data)
library(datasets)

Hide

data(airquality)
head(airquality)
require(graphics)
pairs(airquality, panel = panel.smooth, main = "airquality data")
image.png

Hide

m1 <- lm(Ozone ~ Solar.R + Wind, data = airquality)
m2 <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
htmlreg(list(m1, m2))
summary(airquality)
     Ozone           Solar.R           Wind             Temp           Month            Day      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.0  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.0  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.8  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.0  
 NA's   :37       NA's   :7                                                                      

Listwise Deletion Method

Hide

summary(lm(Ozone ~ Solar.R + Wind + Temp + Month + Day,  data = airquality, na.action = na.omit))

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality, 
    na.action = na.omit)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.014 -12.284  -3.302   8.454  95.348 

Coefficients:
             Estimate Std. Error t value       Pr(>|t|)    
(Intercept) -64.11632   23.48249  -2.730        0.00742 ** 
Solar.R       0.05027    0.02342   2.147        0.03411 *  
Wind         -3.31844    0.64451  -5.149 0.000001231276 ***
Temp          1.89579    0.27389   6.922 0.000000000366 ***
Month        -3.03996    1.51346  -2.009        0.04714 *  
Day           0.27388    0.22967   1.192        0.23576    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.86 on 105 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6249,    Adjusted R-squared:  0.6071 
F-statistic: 34.99 on 5 and 105 DF,  p-value: < 0.00000000000000022

From the summary shown above we can see that 42 observations were deleted due to missingness. There is never a substitute for a complete dataset. By deleting these observations information about its relations with the other variables are being messed with. Imputation or multiple inputation is the proper way to deal with missing data and by using the Amelia package we will help to retrieve the missing values to complete the dataset and help make betters inferences using the data.

Visualing Percentage of Missing data

Hide

V1 <- function(x){sum(is.na(x))/length(x)*100}
apply(airquality,2,V1)
    Ozone   Solar.R      Wind      Temp     Month       Day 
24.183007  4.575163  0.000000  0.000000  0.000000  0.000000 

Hide

apply(airquality,1,V1)
  [1]  0.00000  0.00000  0.00000  0.00000 33.33333 16.66667  0.00000  0.00000  0.00000 16.66667
 [11] 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [21]  0.00000  0.00000  0.00000  0.00000 16.66667 16.66667 33.33333  0.00000  0.00000  0.00000
 [31]  0.00000 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667  0.00000 16.66667  0.00000
 [41]  0.00000 16.66667 16.66667  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000  0.00000
 [51]  0.00000 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667
 [61] 16.66667  0.00000  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000
 [71]  0.00000 16.66667  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000
 [81]  0.00000  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [91]  0.00000  0.00000  0.00000  0.00000  0.00000 16.66667 16.66667 16.66667  0.00000  0.00000
[101]  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000
[111]  0.00000  0.00000  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000 16.66667  0.00000
[121]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[131]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[141]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 16.66667
[151]  0.00000  0.00000  0.00000

In the above chart we can see that of all the observation in the dataset the variable Ozone is missing 25% of its datapoints and the variable Solar.R is missing 4.5%.

Imputation Using Amelia Package

The Amelia package will take care of the imputing process for us.

aq1 <- amelia(x=airquality,  m = 20)
aq1$imputations$imp1[1:6, ]

Above when viewing the NA’s in the dataset we saw that the 5th value for ozone was NA. Here we can see the imputed value that was imputed using Amelia. These imputed values were done 20 times but this is only showing one.

ggplot(data=airquality) + geom_histogram(mapping=aes(Ozone))
image.png

Hide

z.out <- zelig(Ozone ~ Solar.R + Wind + Temp + Month + Day, model = "ls", data = aq1, cite = FALSE)
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, 
    data = aq1)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.300 -13.073  -3.004  12.885  98.904 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)
(Intercept) -78.10871   19.85289  -3.934             0.000128
Solar.R       0.02592    0.02057   1.260             0.209607
Wind         -2.70016    0.54658  -4.940 0.000002098565416779
Temp          2.14191    0.23466   9.128 0.000000000000000502
Month        -3.87204    1.35901  -2.849             0.005013
Day           0.30787    0.19645   1.567             0.119235

Residual standard error: 21.03 on 147 degrees of freedom
Multiple R-squared:  0.5925,    Adjusted R-squared:  0.5786 
F-statistic: 42.74 on 5 and 147 DF,  p-value: < 0.00000000000000022

Next step: Use 'setx' method

This model varies in ways from the model shown above. The negative effect that wind has on the Ozone has actually decreased when we use imputed values into the dataset.

Hide

z.out$setx()
z.out$sim()
plot(z.out)
image.png

Conclusion

The imputation values that replaced the NA values in the dataset had an an effect on the models we ran above. By using the Amelia package we were able to recover some imformation from two variables that contained NA values. This allowed us to create a close to completed dataset using values that were imputed and thus giving us the best method we can use to deal with missing values.

原文链接:https://www.rpubs.com/RobertPerez63/384791

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,588评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,456评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,146评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,387评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,481评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,510评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,522评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,296评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,745评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,039评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,202评论 1 343
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,901评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,538评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,165评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,415评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,081评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,085评论 2 352