2016 Weather Data Exploratory Analysis - Part 5
NOAA Yearly Weather Data Processing
Processing 2016 Temperature Data
Project Synopsis
According to a report published on January 18, 2017 by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA):
…(the) Earth’s 2016 surface temperatures were the warmest since modern record keeping began in 1880.
Globally-averaged temperatures in 2016 were 1.78 degrees Fahrenheit (0.99 degrees Celsius) warmer than the mid-20th century mean. This makes 2016 the third year in a row to set a new record for global average surface temperatures.
Source: https://www.nasa.gov/press-release/nasa-noaa-data-show-2016-warmest-year-on-record-globally
The 2016 Weather Data Exploratory Analysis project was started to review the raw data from NOAA and identify areas of uncertainty and their potential impact on reaching a greater than 95% scientific certainty.
This is Part 5 of the 2016 Weather Data Exploratory Analysis.
This project is designed to review the NOAA yearly weather data files to be used in the temperature analysis portion of the overall project.
The goals of the project are:
- Review the data structure of the yearly weather data observation file.
- Filter for temperature observations.
- Process the data for future use.
- Perform data cleaning as needed.
- Create and save a tidy and clean data set.
This project will use the yearly weather data from 2016 as the input and the methodology detailed will be integrated into the follow up project in which the yearly weather data from 1900-2016 will be processed.
The yearly weather data is located at the following:
The files range from the year 1763 through 2017. These files are gzip
compressed .csv
files.
The specific file we will be using is:
The description of the file can be found at:
Dates and Times of Downloads
-
2016.csv.gz
- 3/1/2017 2:51 PM
Libraries Required
library(dplyr) # Data manipulation
library(knitr) # Dynamic report generation
library(lubridate) # Date and time processing
library(reshape2) # Data transposition
Yearly Weather Data
The 2016 Daily Weather Data file contains the following:
FORMAT OF 2016.csv.gz
Variable | Length | Type |
---|---|---|
ID | 11 | Character |
YEAR/MONTH/DAY | 8 | Character |
ELEMENT | 4 | Character |
DATA VALUE | 5 | Character |
M-FLAG | 1 | Character |
Q-FLAG | 1 | Character |
S-FLAG | 1 | Character |
OBS-TIME | 4 | Character |
Read Weather Data File
The 2016 Daily Weather Data file is a gzip
compressed .csv
file. This is a large file and will take a few moments to load.
Read 2016 Daily Weather Data File
wd_data <- read.csv("~/NOAA Data/RAW WD 1900-2016/2016.csv.gz", sep=",",
header=FALSE, stringsAsFactors = FALSE)
Data Exploration and Processing
Brief Exploration of the data
kable(head(wd_data, 10), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
---|---|---|---|---|---|---|---|
US1FLSL0019 | 20160101 | PRCP | 3 | N | NA | ||
NOE00133566 | 20160101 | TMAX | 95 | E | NA | ||
NOE00133566 | 20160101 | TMIN | 23 | E | NA | ||
NOE00133566 | 20160101 | PRCP | 37 | E | NA | ||
USC00141761 | 20160101 | TMAX | 22 | 7 | 700 | ||
USC00141761 | 20160101 | TMIN | -89 | 7 | 700 | ||
USC00141761 | 20160101 | TOBS | -89 | 7 | 700 | ||
USC00141761 | 20160101 | PRCP | 0 | 7 | 700 | ||
USC00141761 | 20160101 | SNOW | 0 | 7 | NA | ||
USC00141761 | 20160101 | SNWD | 0 | 7 | 700 |
str(wd_data)
## 'data.frame': 32423277 obs. of 8 variables:
## $ V1: chr "US1FLSL0019" "NOE00133566" "NOE00133566" "NOE00133566" ...
## $ V2: int 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 ...
## $ V3: chr "PRCP" "TMAX" "TMIN" "PRCP" ...
## $ V4: int 3 95 23 37 22 -89 -89 0 0 0 ...
## $ V5: chr "" "" "" "" ...
## $ V6: chr "" "" "" "" ...
## $ V7: chr "N" "E" "E" "E" ...
## $ V8: int NA NA NA NA 700 700 700 700 NA 700 ...
The read.csv
has converted V2
, V4
, and V8
to integers.
The FLAG
variables V5
, V6
, V7
are for specific statuses recorded on the first day of each month so they can be removed from wd_data
. We also do not need V8
(OBS-TIME
), since we will be working with daily temperature readings.
Remove V5
, V6
, V7
, V8
wd_data <- wd_data %>%
select(-V5,-V6,-V7,-V8)
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
V1 | V2 | V3 | V4 |
---|---|---|---|
US1FLSL0019 | 20160101 | PRCP | 3 |
NOE00133566 | 20160101 | TMAX | 95 |
NOE00133566 | 20160101 | TMIN | 23 |
NOE00133566 | 20160101 | PRCP | 37 |
USC00141761 | 20160101 | TMAX | 22 |
USC00141761 | 20160101 | TMIN | -89 |
Number of Observations and Number of Unique Stations
format(nrow(wd_data), big.mark = ",")
## [1] "32,423,277"
format(length(unique(wd_data$V1)), big.mark = ",")
## [1] "37,876"
Weather stations collect many different types of data including temperature, precipitation, wind speeds, and so on. This project will focus on temperature related observations only. The type of observation made is recorded in V3
, the Elements
variable and the value of the observation is in V4
.
There are five core Elements:
- PRCP = Precipitation (tenths of mm)
- SNOW = Snowfall (mm)
- SNWD = Snow depth (mm)
- TMAX = Maximum temperature (tenths of degrees C)
- TMIN = Minimum temperature (tenths of degrees C)
Other Temperature related Elements are:
- TAVG = Average temperature (tenths of degrees C)
- TOBS = Temperature at the time of observation (tenths of degrees C)
The temperature at the time of observation is not of value to the study. Therefore, we can explore TMAX
, TMIN
, and TAVG
.
Filter For Temperature Data Elements
temp_vals <- c("TMAX", "TMIN", "TAVG")
wd_data <- wd_data %>%
filter(V3 %in% temp_vals)
Number of Unique Temperature Stations and Element Counts
format(length(unique(wd_data$V1)), big.mark = ",")
## [1] "14,297"
format(table(wd_data$V3), big.mark = ",")
##
## TAVG TMAX TMIN
## "2,165,569" "4,178,373" "4,172,297"
The TAVG
Element has approximately 2,000,000
fewer observations than TMAX
or TMIN
. It appears that TAVG
is not consistently recorded by some of the observation stations. For that reason, we might want to focus on TMAX
and TMIN
only. However, since average temperature is the criteria we want to establish for a complete observation, and it is possible that TAVG
is recorded by some stations when either TMAX
or TMIN
is not recorded, we need to set some rules for observation acceptance:
- If
TAVG
is recorded, keep the observation - If
TAVG
is not recorded, use the average temperature fromTMAX
andTMIN
- If
TAVG
is not recorded and eitherTMAX
orTMIN
is not recorded then the observation is removed
The rules we set for acceptance will allow us to keep as many observations as possible while eliminating incomplete observations.
Review Filtered Data Frame
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
V1 | V2 | V3 | V4 |
---|---|---|---|
NOE00133566 | 20160101 | TMAX | 95 |
NOE00133566 | 20160101 | TMIN | 23 |
USC00141761 | 20160101 | TMAX | 22 |
USC00141761 | 20160101 | TMIN | -89 |
USS0018D08S | 20160101 | TMAX | -25 |
USS0018D08S | 20160101 | TMIN | -177 |
The definition of Tidy Data by Hadley Wickham is:
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
Source: http://vita.had.co.nz/papers/tidy-data.html
In order for the weather data to be Tidy we need to make TAVG
, TMAX
, and TMIN
individual variables with their own observation values.
We can accomplish this by reshaping the data so that TAVG
, TMAX
and TMIN
are variables and their corresponding values are observations using dcast
from the reshape2
package.
Reshape Data Frame
wd_data <- dcast(wd_data, V1 + V2 ~ V3, value.var = "V4")
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
V1 | V2 | TAVG | TMAX | TMIN |
---|---|---|---|---|
AE000041196 | 20160101 | 229 | NA | 155 |
AE000041196 | 20160102 | 243 | 293 | 188 |
AE000041196 | 20160103 | 204 | 235 | NA |
AE000041196 | 20160104 | 192 | 224 | NA |
AE000041196 | 20160105 | 188 | 226 | 139 |
AE000041196 | 20160106 | 201 | NA | NA |
Next we can rename the variables V1
and V2
to something meaningful.
Rename Variables
colnames(wd_data)[1] <- "ID"
colnames(wd_data)[2] <- "Date"
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
ID | Date | TAVG | TMAX | TMIN |
---|---|---|---|---|
AE000041196 | 20160101 | 229 | NA | 155 |
AE000041196 | 20160102 | 243 | 293 | 188 |
AE000041196 | 20160103 | 204 | 235 | NA |
AE000041196 | 20160104 | 192 | 224 | NA |
AE000041196 | 20160105 | 188 | 226 | 139 |
AE000041196 | 20160106 | 201 | NA | NA |
Review Completeness of Data
summary(wd_data[3:5])
## TAVG TMAX TMIN
## Min. :-999.0 Min. :-999.0 Min. :-999.0
## 1st Qu.: 25.0 1st Qu.: 91.0 1st Qu.: -11.0
## Median : 121.0 Median : 194.0 Median : 67.0
## Mean : 110.1 Mean : 177.5 Mean : 61.2
## 3rd Qu.: 214.0 3rd Qu.: 278.0 3rd Qu.: 144.0
## Max. :1356.0 Max. :1833.0 Max. :1356.0
## NA's :2422090 NA's :409286 NA's :415362
As expected, there are a large number of NA
values in the TAVG
,TMAX
, and TMIN
variables. NA
refers to values that are Not Available. From an accuracy perspective, we want to have complete observations.
A complete observation, for the purposes of the project, is a valid TAVG
value along with station ID
and observation Date
.
Let’s start with looking at the number of observations compared to the number of complete observations we have right now.
Note: I have included a display of the code for reporting complete observation statistics using cat(paste())
. For future reporting of information I will not be displaying the code in order to remove unnecessary code chunks to the report.
Complete Observation Statistics
nbr_day <- 366 ## 2016 has 366 days
tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))
tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day
Report Complete Observation Statistics
cat(paste(" Number of Observation Stations: ", " ", format(tot_sta, big.mark = ","), "\n",
"Number of Potential Observations: ", format(pot_obs, big.mark = ","), "\n",
" Number of Actual Observations: ", format(tot_obs, big.mark = ","), "\n",
" Number of Complete Observations: ", format(com_obs, big.mark = ","), sep = ""))
## Number of Observation Stations: 14,297
## Number of Potential Observations: 5,232,702
## Number of Actual Observations: 4,587,659
## Number of Complete Observations: 2,165,569
The number of potential observations is the total number of observation stations times the potential number of observations for the year. In the case of 2016, the number of days is 366.
Note: The number of complete observations does not imply that they are accurate.
At this point, we will need to derive, or impute, an average temperature for observations in which TAVG
is NA
.
In statistics, imputation is the process of replacing missing data with substituted values.
Source: https://en.wikipedia.org/wiki/Imputation_(statistics)
Data Cleaning
The first step in cleaning the data is to remove values that are invalid. The first invalid value we will look for is -999
. When a temperature observation can not be made the recorded value will either be missing, NA
, or contain -999
. To make the upcoming processing and imputing steps cleaner we will substitute -999
with NA
.
Replace Invalid Values
summary(wd_data[3:5])
## TAVG TMAX TMIN
## Min. :-999.0 Min. :-999.0 Min. :-999.0
## 1st Qu.: 25.0 1st Qu.: 91.0 1st Qu.: -11.0
## Median : 121.0 Median : 194.0 Median : 67.0
## Mean : 110.1 Mean : 177.5 Mean : 61.2
## 3rd Qu.: 214.0 3rd Qu.: 278.0 3rd Qu.: 144.0
## Max. :1356.0 Max. :1833.0 Max. :1356.0
## NA's :2422090 NA's :409286 NA's :415362
wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG == -999, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX == -999, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN == -999, NA)
summary(wd_data[3:5])
## TAVG TMAX TMIN
## Min. :-930.0 Min. :-721.0 Min. :-816.0
## 1st Qu.: 25.0 1st Qu.: 91.0 1st Qu.: -11.0
## Median : 121.0 Median : 194.0 Median : 67.0
## Mean : 110.2 Mean : 177.5 Mean : 61.2
## 3rd Qu.: 214.0 3rd Qu.: 278.0 3rd Qu.: 144.0
## Max. :1356.0 Max. :1833.0 Max. :1356.0
## NA's :2422210 NA's :409370 NA's :415451
The number of NA
counts for the three temperature values increased slightly.
The next cleaning step is to check for temperatures at the extremes to determine if they are greater than established record highs or record lows.
The Elements TAVG
, TMAX
, and TMIN
record observations in tenths of a degree Celsius. This project will convert these observations to Fahrenheit. This step can be modified to keep Celsius as the standard of measurement. If you wish to have the Celsius observations in the normal format, simply divide the values by 10 to get the full Celsius value. When checking the range of valid temperatures against all time high and low values, make sure to adjust the temperatures detailed below to their Celsius equivalent.
Celsius Adjustment (not executed, code example only)
wd_data$TMAX <- wd_data$TMAX/10
wd_data$TMIN <- wd_data$TMIN/10
Fahrenheit Adjustment (executed)
wd_data$TAVG <- ((wd_data$TAVG/10*1.8) + 32)
wd_data$TMAX <- ((wd_data$TMAX/10*1.8) + 32)
wd_data$TMIN <- ((wd_data$TMIN/10*1.8) + 32)
summary(wd_data[3:5])
## TAVG TMAX TMIN
## Min. :-135.4 Min. :-97.8 Min. :-114.9
## 1st Qu.: 36.5 1st Qu.: 48.4 1st Qu.: 30.0
## Median : 53.8 Median : 66.9 Median : 44.1
## Mean : 51.8 Mean : 63.9 Mean : 43.0
## 3rd Qu.: 70.5 3rd Qu.: 82.0 3rd Qu.: 57.9
## Max. : 276.1 Max. :361.9 Max. : 276.1
## NA's :2422210 NA's :409370 NA's :415451
The highest recorded temperature was 134.1 degrees Fahrenheit on July 10, 1913 in Death Valley, California, United States according to the World Meteorological Organisation.
The WMO has observed higher maximum temperatures but have been dismissed due to special circumstances such as asphalt surfaces near the station. The standard measuring conditions for temperature are in the air, 1.5 meters above the ground, and shielded from direct sunlight. Temperatures measured directly on the ground may exceed air temperatures by (over 80 degrees Fahrenheit).
Source: http://www.wmo.int/
The lowest recorded temperature was -128.6 degrees Fahrenheit on July 21, 1983 at Vostok Station, Antarctica, according to the World Meteorological Organisation.
Let’s take a look at how many observations exceed the maximum and minimum records according to the WMO.
Check for Extreme Observations
nrow(filter(wd_data, TAVG > 134.1 | TMAX > 134.1 | TMIN > 134.1))
## [1] 55
nrow(filter(wd_data, TAVG < -128.6 | TMAX < -128.6 | TMIN < -128.6))
## [1] 2
A total of 55
observations have temperature readings greater than record highs and 2
that are less than record lows, as observed by the WMO.
This is very small number when compared against the total number of observations. We will consider these observations as invalid and substitute them with NA
.
Replace Extreme Observations
wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG > 134.1, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX > 134.1, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN > 134.1, NA)
wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG < -128.6, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX < -128.6, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN < -128.6, NA)
The final cleaning step is a comparison check between TMAX
and TMIN
. Since we will be imputing the average daily temperature for observations in which TAVG
is not recorded, we should check for conditions in which TMIN
is recorded as being greater than TMAX
.
Condition Check for Temperatures
nrow(filter(wd_data, TMIN > TMAX))
## [1] 259
Again, there are a small number of observations, 259
, that meet this condition. We will invalidate these observations as before. For our purposes, we will consider TMIN
as the invalid value. The same result can be accomplished by invalidating TMAX
, or both values. This will be explained later in the Data Imputing section.
Replace Invalid Observations
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN > wd_data$TMAX, NA)
summary(wd_data[3:5])
## TAVG TMAX TMIN
## Min. :-111.6 Min. :-97.8 Min. :-114.9
## 1st Qu.: 36.5 1st Qu.: 48.4 1st Qu.: 30.0
## Median : 53.8 Median : 66.9 Median : 44.1
## Mean : 51.8 Mean : 63.9 Mean : 43.0
## 3rd Qu.: 70.5 3rd Qu.: 82.0 3rd Qu.: 57.9
## Max. : 134.1 Max. :134.1 Max. : 134.1
## NA's :2422243 NA's :409425 NA's :415738
Let’s review the updated number of observations compared to the number of complete observations.
Complete Observation Statistics
tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))
tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day
Report Complete Observation Statistics
## Number of Observation Stations: 14,297
## Number of Potential Observations: 5,232,702
## Number of Actual Observations: 4,587,659
## Number of Complete Observations: 2,165,416
At this point, we have completed our preliminary cleaning of the observation data. We can move on to calculating an average daily temperature for observations in which TAVG
is NA
.
Data Imputing
Brief Review of the Data
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
ID | Date | TAVG | TMAX | TMIN |
---|---|---|---|---|
AE000041196 | 20160101 | 73.22 | NA | 59.90 |
AE000041196 | 20160102 | 75.74 | 84.74 | 65.84 |
AE000041196 | 20160103 | 68.72 | 74.30 | NA |
AE000041196 | 20160104 | 66.56 | 72.32 | NA |
AE000041196 | 20160105 | 65.84 | 72.68 | 57.02 |
AE000041196 | 20160106 | 68.18 | NA | NA |
Roughly half of the observations have a valid value for TAVG
. For the remaining observations we will derive a daily average temperature based on TMAX
and TMIN
.
There are roughly 5 different methods for calculating average temperatures. Since we do not have temperature readings by hour or minute the most common method remaining is to take the mean
of the daily high and daily low, TMAX
and TMIN
respectively.
There are some biases associated with this method, however. The National Research Center for Statistics and the Environment at the University of Washington produced a very good paper on the differing methods for median
temperature collections.
http://www.nrcse.washington.edu/NordicNetwork/reports/temp.pdf
However, the calculation of the mean
daily temperature is the same process that NOAA uses.
The most common processing steps used to turn daily weather data into climate data products involve straightforward mathematical operations such as averaging or adding measurements. For instance, a station’s daily average (mean) temperature is calculated by averaging the minimum and maximum temperatures for the day: daily average = (minimum + maximum) / 2.
Source: https://www.climate.gov/maps-data/primer/processing-climate-data
When the average temperature is created from TMAX
and TMIN
, and one or both of the values is NA
, the result will be NA
as well.
The first step will be to create a new variable for the derived average temperature, DAVG
.
Create Derived Average Temperature
wd_data <- wd_data %>%
mutate(DAVG = (TMAX + TMIN) / 2)
kable(head(wd_data, 10), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
ID | Date | TAVG | TMAX | TMIN | DAVG |
---|---|---|---|---|---|
AE000041196 | 20160101 | 73.22 | NA | 59.90 | NA |
AE000041196 | 20160102 | 75.74 | 84.74 | 65.84 | 75.29 |
AE000041196 | 20160103 | 68.72 | 74.30 | NA | NA |
AE000041196 | 20160104 | 66.56 | 72.32 | NA | NA |
AE000041196 | 20160105 | 65.84 | 72.68 | 57.02 | 64.85 |
AE000041196 | 20160106 | 68.18 | NA | NA | NA |
AE000041196 | 20160107 | 67.28 | 74.12 | NA | NA |
AE000041196 | 20160108 | 66.74 | 77.00 | 58.46 | 67.73 |
AE000041196 | 20160109 | 67.28 | 80.24 | 56.48 | 68.36 |
AE000041196 | 20160110 | 69.26 | 79.34 | 57.92 | 68.63 |
Note: The derived average temperature, DAVG
, does not need to be equal to TAVG
. This is because TAVG
is the 24 hour average temperature recorded for the period ending at 2400 UTC rather than local midnight.
The next step will replace the value of TAVG
with the value of DAVG
if the current value of TAVG
is NA
. If the values of both TAVG
and DAVG
are NA
then the observation is invalid or incomplete with a value of NA
as it’s observation.
Update Average Temperature
summary(wd_data$TAVG)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -111.6 36.5 53.8 51.8 70.5 134.1 2422243
wd_data <- wd_data %>%
transform(TAVG = ifelse(is.na(TAVG), DAVG, TAVG))
summary(wd_data$TAVG)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -111.64 39.47 55.49 53.53 70.34 134.06 25579
Let’s review the results of the imputation on the number of observations compared to the number of complete observations.
Complete observation statistics
tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))
tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day
Report complete observation statistics
## Number of Observation Stations: 14,297
## Number of Potential Observations: 5,232,702
## Number of Actual Observations: 4,587,659
## Number of Complete Observations: 4,562,080
The final steps remaining are to remove the variables in wd_data
no longer needed (TMAX
, TMIN
, DAVG
) and remove the remaining invalid observations. The removal of the invalid transactions will be accomplished by filtering the data based on complete.cases
, meaning there are no NA
values present in any of the observational variables. This is why it didn’t matter whether we replaced TMIN
or TMAX
values previously. The result of the average calculation would have resulted in NA
either way and thus result in the observation removal during the complete.cases
filtering. This is also why we need to remove the unneeded variables first because they may contain NA
values and complete.cases
would remove the entire observation even though we have a complete observation by our earlier definition.
Remove Unneeded Variables
wd_data <- wd_data %>%
select(-TMAX, -TMIN, -DAVG)
Remove Invalid Observations
summary(wd_data$TAVG)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -111.64 39.47 55.49 53.53 70.34 134.06 25579
wd_data <- wd_data[complete.cases(wd_data),]
summary(wd_data$TAVG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -111.64 39.47 55.49 53.53 70.34 134.06
Let’s review the final statistics for the cleaned data and include the overall percentage of completeness.
Complete Observation Statistics
tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))
tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day
pct_com <- round(com_obs/pot_obs * 100, 2)
Report Complete Observation Statistics
## Number of Observation Stations: 14,285
## Number of Potential Observations: 5,228,310
## Number of Actual Observations: 4,562,080
## Number of Complete Observations: 4,562,080
## Completeness Percentage: 87.26%
- We lost 12 observation stations in the process of cleaning the data
- The completeness percentage falls well below 95%
Data Completion
We will be summarizing the 2016 Daily Weather Data by:
- Day of the year
- Week of the year
- Month of the year
The current Date
variable is in integer format so we will use the package lubridate
to perform some date coercions. The first step is to convert Date
to date format. Next, we can get the Yday
, Week
, and Month
values.
Manipulate Dates
wd_data$Date <- ymd(wd_data$Date)
wd_data$Month <- month(wd_data$Date)
wd_data$Week <- week(wd_data$Date)
wd_data$Yday <- yday(wd_data$Date)
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
ID | Date | TAVG | Month | Week | Yday |
---|---|---|---|---|---|
AE000041196 | 2016-01-01 | 73.22 | 1 | 1 | 1 |
AE000041196 | 2016-01-02 | 75.74 | 1 | 1 | 2 |
AE000041196 | 2016-01-03 | 68.72 | 1 | 1 | 3 |
AE000041196 | 2016-01-04 | 66.56 | 1 | 1 | 4 |
AE000041196 | 2016-01-05 | 65.84 | 1 | 1 | 5 |
AE000041196 | 2016-01-06 | 68.18 | 1 | 1 | 6 |
We now have the needed variables and format required for analyzing and summarizing the data. We can remove the variable Date
and reorder the variables in wd_data
for cleaner viewing. We can accomplish both steps with one command.
Remove Date and Reorder
wd_data <- wd_data %>%
select(ID, Month, Week, Yday, TAVG)
kable(head(wd_data), caption = "wd_data")
<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>
ID | Month | Week | Yday | TAVG |
---|---|---|---|---|
AE000041196 | 1 | 1 | 1 | 73.22 |
AE000041196 | 1 | 1 | 2 | 75.74 |
AE000041196 | 1 | 1 | 3 | 68.72 |
AE000041196 | 1 | 1 | 4 | 66.56 |
AE000041196 | 1 | 1 | 5 | 65.84 |
AE000041196 | 1 | 1 | 6 | 68.18 |
Saving Temperature Data
The completed yearly temperature data can now be saved in .Rds
format to be used in the subsequent projects.
Save Temperature Data
saveRDS(wd_data, "~/NOAA Data/RDS WDT 1900-2016/ww_data_2016.rds")
Key Findings
The most important finding from processing the 2016 yearly weather data is the level of completeness for temperature observations. The final number of temperature stations in the 2016 data collection is 14,285 and the percentage of completion is 87.26%.
Using a very liberal and accomodating acceptance criteria, greater than 1 out of 10 observations is missing from the record. When comparing that statistic to the claim that NOAA and NASA believe that worldwide temperatures are increasing with a scientific certainty greater than 95%, one has to wonder at the validity of the assertion.
This doesn’t mean that the claim that temperatures are increasing is necessarily false. It simply means that the missing data would hinder a true scientific conclusion. Therefore, it has to be assumed that the missing data is ignored or that it is “filled in” using imputational means.
Based on the findings from the previous projects in this study, a key item to explore will be the completeness of observations across Hemispheres and Quadrants. Since each area has it’s own unique historical temperature patterns, the completeness of observations will be critical to determine the impact to global averages.
In addition, determining if there is any pattern related to observation completion from a seasonal perspective (fewer observations in winter periods) will be valuable as well.
These items will be covered in future projects as part of the study.
Next Steps
The next project will create a processing loop that will perform the cleaning and preparation of the RAW yearly weather data using the criteria from this project and apply it to the years 1900-2016.
In addition, the project will record pertinent statistics by year for overall reporting of temperature observation data.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.2 lubridate_1.6.0 knitr_1.15.1 dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 digest_0.6.12 rprojroot_1.2 assertthat_0.2.0
## [5] plyr_1.8.4 R6_2.2.0 DBI_0.6-1 backports_1.0.5
## [9] magrittr_1.5 evaluate_0.10 highr_0.6 stringi_1.1.5
## [13] lazyeval_0.2.0 rmarkdown_1.5 tools_3.4.0 stringr_1.2.0
## [17] yaml_2.1.14 compiler_3.4.0 htmltools_0.3.6 tibble_1.3.0
原文链接:https://rstudio-pubs-static.s3.amazonaws.com/275033_a87f8c52176d4aed8880d56dbf6c3ada.html