2020-09-01-Weather Data Exploratory Analysis - Part 5

2016 Weather Data Exploratory Analysis - Part 5

image.png


NOAA Yearly Weather Data Processing

Processing 2016 Temperature Data



Project Synopsis

According to a report published on January 18, 2017 by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA):

…(the) Earth’s 2016 surface temperatures were the warmest since modern record keeping began in 1880.

Globally-averaged temperatures in 2016 were 1.78 degrees Fahrenheit (0.99 degrees Celsius) warmer than the mid-20th century mean. This makes 2016 the third year in a row to set a new record for global average surface temperatures.

Source: https://www.nasa.gov/press-release/nasa-noaa-data-show-2016-warmest-year-on-record-globally


The 2016 Weather Data Exploratory Analysis project was started to review the raw data from NOAA and identify areas of uncertainty and their potential impact on reaching a greater than 95% scientific certainty.

This is Part 5 of the 2016 Weather Data Exploratory Analysis.


This project is designed to review the NOAA yearly weather data files to be used in the temperature analysis portion of the overall project.

The goals of the project are:

  • Review the data structure of the yearly weather data observation file.
  • Filter for temperature observations.
  • Process the data for future use.
  • Perform data cleaning as needed.
  • Create and save a tidy and clean data set.

This project will use the yearly weather data from 2016 as the input and the methodology detailed will be integrated into the follow up project in which the yearly weather data from 1900-2016 will be processed.


The yearly weather data is located at the following:

The files range from the year 1763 through 2017. These files are gzip compressed .csv files.

The specific file we will be using is:

The description of the file can be found at:

Dates and Times of Downloads

  • 2016.csv.gz - 3/1/2017 2:51 PM

Libraries Required

library(dplyr)        # Data manipulation
library(knitr)        # Dynamic report generation
library(lubridate)    # Date and time processing
library(reshape2)     # Data transposition

Yearly Weather Data

The 2016 Daily Weather Data file contains the following:

FORMAT OF 2016.csv.gz

Variable Length Type
ID 11 Character
YEAR/MONTH/DAY 8 Character
ELEMENT 4 Character
DATA VALUE 5 Character
M-FLAG 1 Character
Q-FLAG 1 Character
S-FLAG 1 Character
OBS-TIME 4 Character

Read Weather Data File

The 2016 Daily Weather Data file is a gzip compressed .csv file. This is a large file and will take a few moments to load.

Read 2016 Daily Weather Data File

wd_data <- read.csv("~/NOAA Data/RAW WD 1900-2016/2016.csv.gz", sep=",", 
                    header=FALSE, stringsAsFactors = FALSE)

Data Exploration and Processing

Brief Exploration of the data

kable(head(wd_data, 10), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

V1 V2 V3 V4 V5 V6 V7 V8
US1FLSL0019 20160101 PRCP 3 N NA
NOE00133566 20160101 TMAX 95 E NA
NOE00133566 20160101 TMIN 23 E NA
NOE00133566 20160101 PRCP 37 E NA
USC00141761 20160101 TMAX 22 7 700
USC00141761 20160101 TMIN -89 7 700
USC00141761 20160101 TOBS -89 7 700
USC00141761 20160101 PRCP 0 7 700
USC00141761 20160101 SNOW 0 7 NA
USC00141761 20160101 SNWD 0 7 700
str(wd_data)
## 'data.frame':    32423277 obs. of  8 variables:
##  $ V1: chr  "US1FLSL0019" "NOE00133566" "NOE00133566" "NOE00133566" ...
##  $ V2: int  20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 ...
##  $ V3: chr  "PRCP" "TMAX" "TMIN" "PRCP" ...
##  $ V4: int  3 95 23 37 22 -89 -89 0 0 0 ...
##  $ V5: chr  "" "" "" "" ...
##  $ V6: chr  "" "" "" "" ...
##  $ V7: chr  "N" "E" "E" "E" ...
##  $ V8: int  NA NA NA NA 700 700 700 700 NA 700 ...

The read.csv has converted V2, V4, and V8 to integers.

The FLAG variables V5, V6, V7 are for specific statuses recorded on the first day of each month so they can be removed from wd_data. We also do not need V8 (OBS-TIME), since we will be working with daily temperature readings.

Remove V5, V6, V7, V8

wd_data <- wd_data %>%
           select(-V5,-V6,-V7,-V8)

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

V1 V2 V3 V4
US1FLSL0019 20160101 PRCP 3
NOE00133566 20160101 TMAX 95
NOE00133566 20160101 TMIN 23
NOE00133566 20160101 PRCP 37
USC00141761 20160101 TMAX 22
USC00141761 20160101 TMIN -89

Number of Observations and Number of Unique Stations

format(nrow(wd_data), big.mark = ",")
## [1] "32,423,277"
format(length(unique(wd_data$V1)), big.mark = ",")
## [1] "37,876"

Weather stations collect many different types of data including temperature, precipitation, wind speeds, and so on. This project will focus on temperature related observations only. The type of observation made is recorded in V3, the Elements variable and the value of the observation is in V4.


There are five core Elements:

  • PRCP = Precipitation (tenths of mm)
  • SNOW = Snowfall (mm)
  • SNWD = Snow depth (mm)
  • TMAX = Maximum temperature (tenths of degrees C)
  • TMIN = Minimum temperature (tenths of degrees C)

Other Temperature related Elements are:

  • TAVG = Average temperature (tenths of degrees C)
  • TOBS = Temperature at the time of observation (tenths of degrees C)

The temperature at the time of observation is not of value to the study. Therefore, we can explore TMAX, TMIN, and TAVG.

Filter For Temperature Data Elements

temp_vals <- c("TMAX", "TMIN", "TAVG")

wd_data <- wd_data %>%
           filter(V3 %in% temp_vals)

Number of Unique Temperature Stations and Element Counts

format(length(unique(wd_data$V1)), big.mark = ",")
## [1] "14,297"
format(table(wd_data$V3), big.mark = ",")
## 
##        TAVG        TMAX        TMIN 
## "2,165,569" "4,178,373" "4,172,297"

The TAVG Element has approximately 2,000,000 fewer observations than TMAX or TMIN. It appears that TAVG is not consistently recorded by some of the observation stations. For that reason, we might want to focus on TMAX and TMIN only. However, since average temperature is the criteria we want to establish for a complete observation, and it is possible that TAVG is recorded by some stations when either TMAX or TMIN is not recorded, we need to set some rules for observation acceptance:

  • If TAVG is recorded, keep the observation
  • If TAVG is not recorded, use the average temperature from TMAX and TMIN
  • If TAVG is not recorded and either TMAX or TMIN is not recorded then the observation is removed

The rules we set for acceptance will allow us to keep as many observations as possible while eliminating incomplete observations.


Review Filtered Data Frame

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

V1 V2 V3 V4
NOE00133566 20160101 TMAX 95
NOE00133566 20160101 TMIN 23
USC00141761 20160101 TMAX 22
USC00141761 20160101 TMIN -89
USS0018D08S 20160101 TMAX -25
USS0018D08S 20160101 TMIN -177

The definition of Tidy Data by Hadley Wickham is:

Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

Source: http://vita.had.co.nz/papers/tidy-data.html

In order for the weather data to be Tidy we need to make TAVG, TMAX, and TMIN individual variables with their own observation values.

We can accomplish this by reshaping the data so that TAVG, TMAX and TMIN are variables and their corresponding values are observations using dcast from the reshape2 package.

Reshape Data Frame

wd_data <- dcast(wd_data, V1 + V2 ~ V3, value.var = "V4")

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

V1 V2 TAVG TMAX TMIN
AE000041196 20160101 229 NA 155
AE000041196 20160102 243 293 188
AE000041196 20160103 204 235 NA
AE000041196 20160104 192 224 NA
AE000041196 20160105 188 226 139
AE000041196 20160106 201 NA NA

Next we can rename the variables V1 and V2 to something meaningful.

Rename Variables

colnames(wd_data)[1] <- "ID"
colnames(wd_data)[2] <- "Date"

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

ID Date TAVG TMAX TMIN
AE000041196 20160101 229 NA 155
AE000041196 20160102 243 293 188
AE000041196 20160103 204 235 NA
AE000041196 20160104 192 224 NA
AE000041196 20160105 188 226 139
AE000041196 20160106 201 NA NA

Review Completeness of Data

summary(wd_data[3:5])
##       TAVG              TMAX             TMIN       
##  Min.   :-999.0    Min.   :-999.0   Min.   :-999.0  
##  1st Qu.:  25.0    1st Qu.:  91.0   1st Qu.: -11.0  
##  Median : 121.0    Median : 194.0   Median :  67.0  
##  Mean   : 110.1    Mean   : 177.5   Mean   :  61.2  
##  3rd Qu.: 214.0    3rd Qu.: 278.0   3rd Qu.: 144.0  
##  Max.   :1356.0    Max.   :1833.0   Max.   :1356.0  
##  NA's   :2422090   NA's   :409286   NA's   :415362

As expected, there are a large number of NA values in the TAVG,TMAX, and TMIN variables. NA refers to values that are Not Available. From an accuracy perspective, we want to have complete observations.

A complete observation, for the purposes of the project, is a valid TAVG value along with station ID and observation Date.

Let’s start with looking at the number of observations compared to the number of complete observations we have right now.

Note: I have included a display of the code for reporting complete observation statistics using cat(paste()). For future reporting of information I will not be displaying the code in order to remove unnecessary code chunks to the report.

Complete Observation Statistics

nbr_day <- 366                                          ## 2016 has 366 days

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

Report Complete Observation Statistics

cat(paste("  Number of Observation Stations:   ", "   ", format(tot_sta, big.mark = ","),  "\n",
          "Number of Potential Observations:   ",        format(pot_obs, big.mark = ","),  "\n",
          "   Number of Actual Observations:   ",        format(tot_obs, big.mark = ","),  "\n",
          " Number of Complete Observations:   ",        format(com_obs, big.mark = ","),  sep = ""))
##   Number of Observation Stations:      14,297
## Number of Potential Observations:   5,232,702
##    Number of Actual Observations:   4,587,659
##  Number of Complete Observations:   2,165,569

The number of potential observations is the total number of observation stations times the potential number of observations for the year. In the case of 2016, the number of days is 366.

Note: The number of complete observations does not imply that they are accurate.

At this point, we will need to derive, or impute, an average temperature for observations in which TAVG is NA.

In statistics, imputation is the process of replacing missing data with substituted values.

Source: https://en.wikipedia.org/wiki/Imputation_(statistics)


Data Cleaning

The first step in cleaning the data is to remove values that are invalid. The first invalid value we will look for is -999. When a temperature observation can not be made the recorded value will either be missing, NA, or contain -999. To make the upcoming processing and imputing steps cleaner we will substitute -999 with NA.

Replace Invalid Values

summary(wd_data[3:5])
##       TAVG              TMAX             TMIN       
##  Min.   :-999.0    Min.   :-999.0   Min.   :-999.0  
##  1st Qu.:  25.0    1st Qu.:  91.0   1st Qu.: -11.0  
##  Median : 121.0    Median : 194.0   Median :  67.0  
##  Mean   : 110.1    Mean   : 177.5   Mean   :  61.2  
##  3rd Qu.: 214.0    3rd Qu.: 278.0   3rd Qu.: 144.0  
##  Max.   :1356.0    Max.   :1833.0   Max.   :1356.0  
##  NA's   :2422090   NA's   :409286   NA's   :415362
wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG == -999, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX == -999, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN == -999, NA)

summary(wd_data[3:5])
##       TAVG              TMAX             TMIN       
##  Min.   :-930.0    Min.   :-721.0   Min.   :-816.0  
##  1st Qu.:  25.0    1st Qu.:  91.0   1st Qu.: -11.0  
##  Median : 121.0    Median : 194.0   Median :  67.0  
##  Mean   : 110.2    Mean   : 177.5   Mean   :  61.2  
##  3rd Qu.: 214.0    3rd Qu.: 278.0   3rd Qu.: 144.0  
##  Max.   :1356.0    Max.   :1833.0   Max.   :1356.0  
##  NA's   :2422210   NA's   :409370   NA's   :415451

The number of NA counts for the three temperature values increased slightly.


The next cleaning step is to check for temperatures at the extremes to determine if they are greater than established record highs or record lows.

The Elements TAVG, TMAX, and TMIN record observations in tenths of a degree Celsius. This project will convert these observations to Fahrenheit. This step can be modified to keep Celsius as the standard of measurement. If you wish to have the Celsius observations in the normal format, simply divide the values by 10 to get the full Celsius value. When checking the range of valid temperatures against all time high and low values, make sure to adjust the temperatures detailed below to their Celsius equivalent.

Celsius Adjustment (not executed, code example only)

wd_data$TMAX <- wd_data$TMAX/10
wd_data$TMIN <- wd_data$TMIN/10

Fahrenheit Adjustment (executed)

wd_data$TAVG <- ((wd_data$TAVG/10*1.8) + 32)
wd_data$TMAX <- ((wd_data$TMAX/10*1.8) + 32)
wd_data$TMIN <- ((wd_data$TMIN/10*1.8) + 32)

summary(wd_data[3:5])
##       TAVG              TMAX             TMIN       
##  Min.   :-135.4    Min.   :-97.8    Min.   :-114.9  
##  1st Qu.:  36.5    1st Qu.: 48.4    1st Qu.:  30.0  
##  Median :  53.8    Median : 66.9    Median :  44.1  
##  Mean   :  51.8    Mean   : 63.9    Mean   :  43.0  
##  3rd Qu.:  70.5    3rd Qu.: 82.0    3rd Qu.:  57.9  
##  Max.   : 276.1    Max.   :361.9    Max.   : 276.1  
##  NA's   :2422210   NA's   :409370   NA's   :415451

The highest recorded temperature was 134.1 degrees Fahrenheit on July 10, 1913 in Death Valley, California, United States according to the World Meteorological Organisation.

The WMO has observed higher maximum temperatures but have been dismissed due to special circumstances such as asphalt surfaces near the station. The standard measuring conditions for temperature are in the air, 1.5 meters above the ground, and shielded from direct sunlight. Temperatures measured directly on the ground may exceed air temperatures by (over 80 degrees Fahrenheit).

Source: http://www.wmo.int/

The lowest recorded temperature was -128.6 degrees Fahrenheit on July 21, 1983 at Vostok Station, Antarctica, according to the World Meteorological Organisation.

Let’s take a look at how many observations exceed the maximum and minimum records according to the WMO.

Check for Extreme Observations

nrow(filter(wd_data, TAVG > 134.1  | TMAX > 134.1  | TMIN > 134.1))
## [1] 55
nrow(filter(wd_data, TAVG < -128.6 | TMAX < -128.6 | TMIN < -128.6))
## [1] 2

A total of 55 observations have temperature readings greater than record highs and 2 that are less than record lows, as observed by the WMO.

This is very small number when compared against the total number of observations. We will consider these observations as invalid and substitute them with NA.

Replace Extreme Observations

wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG > 134.1, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX > 134.1, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN > 134.1, NA)

wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG < -128.6, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX < -128.6, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN < -128.6, NA)

The final cleaning step is a comparison check between TMAX and TMIN. Since we will be imputing the average daily temperature for observations in which TAVG is not recorded, we should check for conditions in which TMIN is recorded as being greater than TMAX.

Condition Check for Temperatures

nrow(filter(wd_data, TMIN > TMAX))
## [1] 259

Again, there are a small number of observations, 259, that meet this condition. We will invalidate these observations as before. For our purposes, we will consider TMIN as the invalid value. The same result can be accomplished by invalidating TMAX, or both values. This will be explained later in the Data Imputing section.

Replace Invalid Observations

wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN > wd_data$TMAX, NA)

summary(wd_data[3:5])
##       TAVG              TMAX             TMIN       
##  Min.   :-111.6    Min.   :-97.8    Min.   :-114.9  
##  1st Qu.:  36.5    1st Qu.: 48.4    1st Qu.:  30.0  
##  Median :  53.8    Median : 66.9    Median :  44.1  
##  Mean   :  51.8    Mean   : 63.9    Mean   :  43.0  
##  3rd Qu.:  70.5    3rd Qu.: 82.0    3rd Qu.:  57.9  
##  Max.   : 134.1    Max.   :134.1    Max.   : 134.1  
##  NA's   :2422243   NA's   :409425   NA's   :415738

Let’s review the updated number of observations compared to the number of complete observations.

Complete Observation Statistics

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

Report Complete Observation Statistics

##   Number of Observation Stations:      14,297
## Number of Potential Observations:   5,232,702
##    Number of Actual Observations:   4,587,659
##  Number of Complete Observations:   2,165,416

At this point, we have completed our preliminary cleaning of the observation data. We can move on to calculating an average daily temperature for observations in which TAVG is NA.


Data Imputing

Brief Review of the Data

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

ID Date TAVG TMAX TMIN
AE000041196 20160101 73.22 NA 59.90
AE000041196 20160102 75.74 84.74 65.84
AE000041196 20160103 68.72 74.30 NA
AE000041196 20160104 66.56 72.32 NA
AE000041196 20160105 65.84 72.68 57.02
AE000041196 20160106 68.18 NA NA

Roughly half of the observations have a valid value for TAVG. For the remaining observations we will derive a daily average temperature based on TMAX and TMIN.

There are roughly 5 different methods for calculating average temperatures. Since we do not have temperature readings by hour or minute the most common method remaining is to take the mean of the daily high and daily low, TMAX and TMIN respectively.

There are some biases associated with this method, however. The National Research Center for Statistics and the Environment at the University of Washington produced a very good paper on the differing methods for median temperature collections.

http://www.nrcse.washington.edu/NordicNetwork/reports/temp.pdf

However, the calculation of the mean daily temperature is the same process that NOAA uses.

The most common processing steps used to turn daily weather data into climate data products involve straightforward mathematical operations such as averaging or adding measurements. For instance, a station’s daily average (mean) temperature is calculated by averaging the minimum and maximum temperatures for the day: daily average = (minimum + maximum) / 2.

Source: https://www.climate.gov/maps-data/primer/processing-climate-data

When the average temperature is created from TMAX and TMIN, and one or both of the values is NA, the result will be NA as well.

The first step will be to create a new variable for the derived average temperature, DAVG.

Create Derived Average Temperature

wd_data <- wd_data %>%
           mutate(DAVG = (TMAX + TMIN) / 2)

kable(head(wd_data, 10), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

ID Date TAVG TMAX TMIN DAVG
AE000041196 20160101 73.22 NA 59.90 NA
AE000041196 20160102 75.74 84.74 65.84 75.29
AE000041196 20160103 68.72 74.30 NA NA
AE000041196 20160104 66.56 72.32 NA NA
AE000041196 20160105 65.84 72.68 57.02 64.85
AE000041196 20160106 68.18 NA NA NA
AE000041196 20160107 67.28 74.12 NA NA
AE000041196 20160108 66.74 77.00 58.46 67.73
AE000041196 20160109 67.28 80.24 56.48 68.36
AE000041196 20160110 69.26 79.34 57.92 68.63

Note: The derived average temperature, DAVG, does not need to be equal to TAVG. This is because TAVG is the 24 hour average temperature recorded for the period ending at 2400 UTC rather than local midnight.

The next step will replace the value of TAVG with the value of DAVG if the current value of TAVG is NA. If the values of both TAVG and DAVG are NA then the observation is invalid or incomplete with a value of NA as it’s observation.

Update Average Temperature

summary(wd_data$TAVG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -111.6    36.5    53.8    51.8    70.5   134.1 2422243
wd_data <- wd_data %>%
           transform(TAVG = ifelse(is.na(TAVG), DAVG, TAVG))

summary(wd_data$TAVG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -111.64   39.47   55.49   53.53   70.34  134.06   25579

Let’s review the results of the imputation on the number of observations compared to the number of complete observations.

Complete observation statistics

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

Report complete observation statistics

##   Number of Observation Stations:      14,297
## Number of Potential Observations:   5,232,702
##    Number of Actual Observations:   4,587,659
##  Number of Complete Observations:   4,562,080

The final steps remaining are to remove the variables in wd_data no longer needed (TMAX, TMIN, DAVG) and remove the remaining invalid observations. The removal of the invalid transactions will be accomplished by filtering the data based on complete.cases, meaning there are no NA values present in any of the observational variables. This is why it didn’t matter whether we replaced TMIN or TMAX values previously. The result of the average calculation would have resulted in NA either way and thus result in the observation removal during the complete.cases filtering. This is also why we need to remove the unneeded variables first because they may contain NA values and complete.cases would remove the entire observation even though we have a complete observation by our earlier definition.

Remove Unneeded Variables

wd_data <- wd_data %>%
           select(-TMAX, -TMIN, -DAVG)

Remove Invalid Observations

summary(wd_data$TAVG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -111.64   39.47   55.49   53.53   70.34  134.06   25579
wd_data <- wd_data[complete.cases(wd_data),]

summary(wd_data$TAVG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -111.64   39.47   55.49   53.53   70.34  134.06

Let’s review the final statistics for the cleaned data and include the overall percentage of completeness.

Complete Observation Statistics

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

pct_com <- round(com_obs/pot_obs * 100, 2)

Report Complete Observation Statistics

##   Number of Observation Stations:      14,285
## Number of Potential Observations:   5,228,310
##    Number of Actual Observations:   4,562,080
##  Number of Complete Observations:   4,562,080
##          Completeness Percentage:      87.26%
  • We lost 12 observation stations in the process of cleaning the data
  • The completeness percentage falls well below 95%

Data Completion

We will be summarizing the 2016 Daily Weather Data by:

  • Day of the year
  • Week of the year
  • Month of the year

The current Date variable is in integer format so we will use the package lubridate to perform some date coercions. The first step is to convert Date to date format. Next, we can get the Yday, Week, and Month values.

Manipulate Dates

wd_data$Date <- ymd(wd_data$Date)

wd_data$Month <- month(wd_data$Date)
wd_data$Week  <- week(wd_data$Date)
wd_data$Yday  <- yday(wd_data$Date)

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

ID Date TAVG Month Week Yday
AE000041196 2016-01-01 73.22 1 1 1
AE000041196 2016-01-02 75.74 1 1 2
AE000041196 2016-01-03 68.72 1 1 3
AE000041196 2016-01-04 66.56 1 1 4
AE000041196 2016-01-05 65.84 1 1 5
AE000041196 2016-01-06 68.18 1 1 6

We now have the needed variables and format required for analyzing and summarizing the data. We can remove the variable Date and reorder the variables in wd_data for cleaner viewing. We can accomplish both steps with one command.

Remove Date and Reorder

wd_data <- wd_data %>%
           select(ID, Month, Week, Yday, TAVG)

kable(head(wd_data), caption = "wd_data")

<caption style="box-sizing: border-box; padding-top: 8px; padding-bottom: 8px; color: rgb(153, 153, 153); text-align: left;">wd_data</caption>

ID Month Week Yday TAVG
AE000041196 1 1 1 73.22
AE000041196 1 1 2 75.74
AE000041196 1 1 3 68.72
AE000041196 1 1 4 66.56
AE000041196 1 1 5 65.84
AE000041196 1 1 6 68.18

Saving Temperature Data

The completed yearly temperature data can now be saved in .Rds format to be used in the subsequent projects.

Save Temperature Data

saveRDS(wd_data, "~/NOAA Data/RDS WDT 1900-2016/ww_data_2016.rds")

Key Findings

The most important finding from processing the 2016 yearly weather data is the level of completeness for temperature observations. The final number of temperature stations in the 2016 data collection is 14,285 and the percentage of completion is 87.26%.

Using a very liberal and accomodating acceptance criteria, greater than 1 out of 10 observations is missing from the record. When comparing that statistic to the claim that NOAA and NASA believe that worldwide temperatures are increasing with a scientific certainty greater than 95%, one has to wonder at the validity of the assertion.

This doesn’t mean that the claim that temperatures are increasing is necessarily false. It simply means that the missing data would hinder a true scientific conclusion. Therefore, it has to be assumed that the missing data is ignored or that it is “filled in” using imputational means.

Based on the findings from the previous projects in this study, a key item to explore will be the completeness of observations across Hemispheres and Quadrants. Since each area has it’s own unique historical temperature patterns, the completeness of observations will be critical to determine the impact to global averages.

In addition, determining if there is any pattern related to observation completion from a seasonal perspective (fewer observations in winter periods) will be valuable as well.

These items will be covered in future projects as part of the study.


Next Steps

The next project will create a processing loop that will perform the cleaning and preparation of the RAW yearly weather data using the criteria from this project and apply it to the years 1900-2016.

In addition, the project will record pertinent statistics by year for overall reporting of temperature observation data.



image.png

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.2  lubridate_1.6.0 knitr_1.15.1    dplyr_0.5.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     digest_0.6.12    rprojroot_1.2    assertthat_0.2.0
##  [5] plyr_1.8.4       R6_2.2.0         DBI_0.6-1        backports_1.0.5 
##  [9] magrittr_1.5     evaluate_0.10    highr_0.6        stringi_1.1.5   
## [13] lazyeval_0.2.0   rmarkdown_1.5    tools_3.4.0      stringr_1.2.0   
## [17] yaml_2.1.14      compiler_3.4.0   htmltools_0.3.6  tibble_1.3.0

原文链接:https://rstudio-pubs-static.s3.amazonaws.com/275033_a87f8c52176d4aed8880d56dbf6c3ada.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,588评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,456评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,146评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,387评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,481评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,510评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,522评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,296评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,745评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,039评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,202评论 1 343
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,901评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,538评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,165评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,415评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,081评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,085评论 2 352