2016 Weather Data Exploratory Analysis - Part 5

image.png

NOAA Yearly Weather Data Processing

Processing 2016 Temperature Data

Project Synopsis

According to a report published on January 18, 2017 by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA):

…(the) Earth’s 2016 surface temperatures were the warmest since modern record keeping began in 1880.

Globally-averaged temperatures in 2016 were 1.78 degrees Fahrenheit (0.99 degrees Celsius) warmer than the mid-20th century mean. This makes 2016 the third year in a row to set a new record for global average surface temperatures.

Source: https://www.nasa.gov/press-release/nasa-noaa-data-show-2016-warmest-year-on-record-globally

The 2016 Weather Data Exploratory Analysis project was started to review the raw data from NOAA and identify areas of uncertainty and their potential impact on reaching a greater than 95% scientific certainty.

This is Part 5 of the 2016 Weather Data Exploratory Analysis.

This project is designed to review the NOAA yearly weather data files to be used in the temperature analysis portion of the overall project.

The goals of the project are:

Review the data structure of the yearly weather data observation file.
Filter for temperature observations.
Process the data for future use.
Perform data cleaning as needed.
Create and save a tidy and clean data set.

This project will use the yearly weather data from 2016 as the input and the methodology detailed will be integrated into the follow up project in which the yearly weather data from 1900-2016 will be processed.

The yearly weather data is located at the following:

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/

The files range from the year 1763 through 2017. These files are gzip compressed .csv files.

The specific file we will be using is:

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz

The description of the file can be found at:

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ghcn-daily-by_year-format.rtf

Dates and Times of Downloads

2016.csv.gz - 3/1/2017 2:51 PM

Libraries Required

library(dplyr)        # Data manipulation
library(knitr)        # Dynamic report generation
library(lubridate)    # Date and time processing
library(reshape2)     # Data transposition

Yearly Weather Data

The 2016 Daily Weather Data file contains the following:

FORMAT OF 2016.csv.gz

Variable	Length	Type
ID	11	Character
YEAR/MONTH/DAY	8	Character
ELEMENT	4	Character
DATA VALUE	5	Character
M-FLAG	1	Character
Q-FLAG	1	Character
S-FLAG	1	Character
OBS-TIME	4	Character

Read Weather Data File

The 2016 Daily Weather Data file is a gzip compressed .csv file. This is a large file and will take a few moments to load.

Read 2016 Daily Weather Data File

wd_data <- read.csv("~/NOAA Data/RAW WD 1900-2016/2016.csv.gz", sep=",", 
                    header=FALSE, stringsAsFactors = FALSE)

Data Exploration and Processing

Brief Exploration of the data

kable(head(wd_data, 10), caption = "wd_data")

V1	V2	V3	V4	V7	V8
US1FLSL0019	20160101	PRCP	3	N	NA
NOE00133566	20160101	TMAX	95	E	NA
NOE00133566	20160101	TMIN	23	E	NA
NOE00133566	20160101	PRCP	37	E	NA
USC00141761	20160101	TMAX	22	7	700
USC00141761	20160101	TMIN	-89	7	700
USC00141761	20160101	TOBS	-89	7	700
USC00141761	20160101	PRCP	0	7	700
USC00141761	20160101	SNOW	0	7	NA
USC00141761	20160101	SNWD	0	7	700

str(wd_data)

## 'data.frame':    32423277 obs. of  8 variables:
##  $ V1: chr  "US1FLSL0019" "NOE00133566" "NOE00133566" "NOE00133566" ...
##  $ V2: int  20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 20160101 ...
##  $ V3: chr  "PRCP" "TMAX" "TMIN" "PRCP" ...
##  $ V4: int  3 95 23 37 22 -89 -89 0 0 0 ...
##  $ V5: chr  "" "" "" "" ...
##  $ V6: chr  "" "" "" "" ...
##  $ V7: chr  "N" "E" "E" "E" ...
##  $ V8: int  NA NA NA NA 700 700 700 700 NA 700 ...

The read.csv has converted V2, V4, and V8 to integers.

The FLAG variables V5, V6, V7 are for specific statuses recorded on the first day of each month so they can be removed from wd_data. We also do not need V8 (OBS-TIME), since we will be working with daily temperature readings.

Remove V5, V6, V7, V8

wd_data <- wd_data %>%
           select(-V5,-V6,-V7,-V8)

kable(head(wd_data), caption = "wd_data")

V1	V2	V3	V4
US1FLSL0019	20160101	PRCP	3
NOE00133566	20160101	TMAX	95
NOE00133566	20160101	TMIN	23
NOE00133566	20160101	PRCP	37
USC00141761	20160101	TMAX	22
USC00141761	20160101	TMIN	-89

Number of Observations and Number of Unique Stations

format(nrow(wd_data), big.mark = ",")

## [1] "32,423,277"

format(length(unique(wd_data$V1)), big.mark = ",")

## [1] "37,876"

Weather stations collect many different types of data including temperature, precipitation, wind speeds, and so on. This project will focus on temperature related observations only. The type of observation made is recorded in V3, the Elements variable and the value of the observation is in V4.

There are five core Elements:

PRCP = Precipitation (tenths of mm)
SNOW = Snowfall (mm)
SNWD = Snow depth (mm)
TMAX = Maximum temperature (tenths of degrees C)
TMIN = Minimum temperature (tenths of degrees C)

Other Temperature related Elements are:

TAVG = Average temperature (tenths of degrees C)
TOBS = Temperature at the time of observation (tenths of degrees C)

The temperature at the time of observation is not of value to the study. Therefore, we can explore TMAX, TMIN, and TAVG.

Filter For Temperature Data Elements

temp_vals <- c("TMAX", "TMIN", "TAVG")

wd_data <- wd_data %>%
           filter(V3 %in% temp_vals)

Number of Unique Temperature Stations and Element Counts

format(length(unique(wd_data$V1)), big.mark = ",")

## [1] "14,297"

format(table(wd_data$V3), big.mark = ",")

## 
##        TAVG        TMAX        TMIN 
## "2,165,569" "4,178,373" "4,172,297"

The TAVG Element has approximately 2,000,000 fewer observations than TMAX or TMIN. It appears that TAVG is not consistently recorded by some of the observation stations. For that reason, we might want to focus on TMAX and TMIN only. However, since average temperature is the criteria we want to establish for a complete observation, and it is possible that TAVG is recorded by some stations when either TMAX or TMIN is not recorded, we need to set some rules for observation acceptance:

If TAVG is recorded, keep the observation
If TAVG is not recorded, use the average temperature from TMAX and TMIN
If TAVG is not recorded and either TMAX or TMIN is not recorded then the observation is removed

The rules we set for acceptance will allow us to keep as many observations as possible while eliminating incomplete observations.

Review Filtered Data Frame

kable(head(wd_data), caption = "wd_data")

V1	V2	V3	V4
NOE00133566	20160101	TMAX	95
NOE00133566	20160101	TMIN	23
USC00141761	20160101	TMAX	22
USC00141761	20160101	TMIN	-89
USS0018D08S	20160101	TMAX	-25
USS0018D08S	20160101	TMIN	-177

The definition of Tidy Data by Hadley Wickham is:

Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

Source: http://vita.had.co.nz/papers/tidy-data.html

In order for the weather data to be Tidy we need to make TAVG, TMAX, and TMIN individual variables with their own observation values.

We can accomplish this by reshaping the data so that TAVG, TMAX and TMIN are variables and their corresponding values are observations using dcast from the reshape2 package.

Reshape Data Frame

wd_data <- dcast(wd_data, V1 + V2 ~ V3, value.var = "V4")

kable(head(wd_data), caption = "wd_data")

V1	V2	TAVG	TMAX	TMIN
AE000041196	20160101	229	NA	155
AE000041196	20160102	243	293	188
AE000041196	20160103	204	235	NA
AE000041196	20160104	192	224	NA
AE000041196	20160105	188	226	139
AE000041196	20160106	201	NA	NA

Next we can rename the variables V1 and V2 to something meaningful.

Rename Variables

colnames(wd_data)[1] <- "ID"
colnames(wd_data)[2] <- "Date"

kable(head(wd_data), caption = "wd_data")

ID	Date	TAVG	TMAX	TMIN
AE000041196	20160101	229	NA	155
AE000041196	20160102	243	293	188
AE000041196	20160103	204	235	NA
AE000041196	20160104	192	224	NA
AE000041196	20160105	188	226	139
AE000041196	20160106	201	NA	NA

Review Completeness of Data

summary(wd_data[3:5])

##       TAVG              TMAX             TMIN       
##  Min.   :-999.0    Min.   :-999.0   Min.   :-999.0  
##  1st Qu.:  25.0    1st Qu.:  91.0   1st Qu.: -11.0  
##  Median : 121.0    Median : 194.0   Median :  67.0  
##  Mean   : 110.1    Mean   : 177.5   Mean   :  61.2  
##  3rd Qu.: 214.0    3rd Qu.: 278.0   3rd Qu.: 144.0  
##  Max.   :1356.0    Max.   :1833.0   Max.   :1356.0  
##  NA's   :2422090   NA's   :409286   NA's   :415362

As expected, there are a large number of NA values in the TAVG,TMAX, and TMIN variables. NA refers to values that are Not Available. From an accuracy perspective, we want to have complete observations.

A complete observation, for the purposes of the project, is a valid TAVG value along with station ID and observation Date.

Let’s start with looking at the number of observations compared to the number of complete observations we have right now.

Note: I have included a display of the code for reporting complete observation statistics using cat(paste()). For future reporting of information I will not be displaying the code in order to remove unnecessary code chunks to the report.

Complete Observation Statistics

nbr_day <- 366                                          ## 2016 has 366 days

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

Report Complete Observation Statistics

cat(paste("  Number of Observation Stations:   ", "   ", format(tot_sta, big.mark = ","),  "\n",
          "Number of Potential Observations:   ",        format(pot_obs, big.mark = ","),  "\n",
          "   Number of Actual Observations:   ",        format(tot_obs, big.mark = ","),  "\n",
          " Number of Complete Observations:   ",        format(com_obs, big.mark = ","),  sep = ""))

##   Number of Observation Stations:      14,297
## Number of Potential Observations:   5,232,702
##    Number of Actual Observations:   4,587,659
##  Number of Complete Observations:   2,165,569

The number of potential observations is the total number of observation stations times the potential number of observations for the year. In the case of 2016, the number of days is 366.

Note: The number of complete observations does not imply that they are accurate.

At this point, we will need to derive, or impute, an average temperature for observations in which TAVG is NA.

In statistics, imputation is the process of replacing missing data with substituted values.

Source: https://en.wikipedia.org/wiki/Imputation_(statistics)

Data Cleaning

The first step in cleaning the data is to remove values that are invalid. The first invalid value we will look for is -999. When a temperature observation can not be made the recorded value will either be missing, NA, or contain -999. To make the upcoming processing and imputing steps cleaner we will substitute -999 with NA.

Replace Invalid Values

summary(wd_data[3:5])

##       TAVG              TMAX             TMIN       
##  Min.   :-999.0    Min.   :-999.0   Min.   :-999.0  
##  1st Qu.:  25.0    1st Qu.:  91.0   1st Qu.: -11.0  
##  Median : 121.0    Median : 194.0   Median :  67.0  
##  Mean   : 110.1    Mean   : 177.5   Mean   :  61.2  
##  3rd Qu.: 214.0    3rd Qu.: 278.0   3rd Qu.: 144.0  
##  Max.   :1356.0    Max.   :1833.0   Max.   :1356.0  
##  NA's   :2422090   NA's   :409286   NA's   :415362

wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG == -999, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX == -999, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN == -999, NA)

summary(wd_data[3:5])

##       TAVG              TMAX             TMIN       
##  Min.   :-930.0    Min.   :-721.0   Min.   :-816.0  
##  1st Qu.:  25.0    1st Qu.:  91.0   1st Qu.: -11.0  
##  Median : 121.0    Median : 194.0   Median :  67.0  
##  Mean   : 110.2    Mean   : 177.5   Mean   :  61.2  
##  3rd Qu.: 214.0    3rd Qu.: 278.0   3rd Qu.: 144.0  
##  Max.   :1356.0    Max.   :1833.0   Max.   :1356.0  
##  NA's   :2422210   NA's   :409370   NA's   :415451

The number of NA counts for the three temperature values increased slightly.

The next cleaning step is to check for temperatures at the extremes to determine if they are greater than established record highs or record lows.

The Elements TAVG, TMAX, and TMIN record observations in tenths of a degree Celsius. This project will convert these observations to Fahrenheit. This step can be modified to keep Celsius as the standard of measurement. If you wish to have the Celsius observations in the normal format, simply divide the values by 10 to get the full Celsius value. When checking the range of valid temperatures against all time high and low values, make sure to adjust the temperatures detailed below to their Celsius equivalent.

Celsius Adjustment (not executed, code example only)

wd_data$TMAX <- wd_data$TMAX/10
wd_data$TMIN <- wd_data$TMIN/10

Fahrenheit Adjustment (executed)

wd_data$TAVG <- ((wd_data$TAVG/10*1.8) + 32)
wd_data$TMAX <- ((wd_data$TMAX/10*1.8) + 32)
wd_data$TMIN <- ((wd_data$TMIN/10*1.8) + 32)

summary(wd_data[3:5])

##       TAVG              TMAX             TMIN       
##  Min.   :-135.4    Min.   :-97.8    Min.   :-114.9  
##  1st Qu.:  36.5    1st Qu.: 48.4    1st Qu.:  30.0  
##  Median :  53.8    Median : 66.9    Median :  44.1  
##  Mean   :  51.8    Mean   : 63.9    Mean   :  43.0  
##  3rd Qu.:  70.5    3rd Qu.: 82.0    3rd Qu.:  57.9  
##  Max.   : 276.1    Max.   :361.9    Max.   : 276.1  
##  NA's   :2422210   NA's   :409370   NA's   :415451

The highest recorded temperature was 134.1 degrees Fahrenheit on July 10, 1913 in Death Valley, California, United States according to the World Meteorological Organisation.

The WMO has observed higher maximum temperatures but have been dismissed due to special circumstances such as asphalt surfaces near the station. The standard measuring conditions for temperature are in the air, 1.5 meters above the ground, and shielded from direct sunlight. Temperatures measured directly on the ground may exceed air temperatures by (over 80 degrees Fahrenheit).

Source: http://www.wmo.int/

The lowest recorded temperature was -128.6 degrees Fahrenheit on July 21, 1983 at Vostok Station, Antarctica, according to the World Meteorological Organisation.

Let’s take a look at how many observations exceed the maximum and minimum records according to the WMO.

Check for Extreme Observations

nrow(filter(wd_data, TAVG > 134.1  | TMAX > 134.1  | TMIN > 134.1))

## [1] 55

nrow(filter(wd_data, TAVG < -128.6 | TMAX < -128.6 | TMIN < -128.6))

## [1] 2

A total of 55 observations have temperature readings greater than record highs and 2 that are less than record lows, as observed by the WMO.

This is very small number when compared against the total number of observations. We will consider these observations as invalid and substitute them with NA.

Replace Extreme Observations

wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG > 134.1, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX > 134.1, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN > 134.1, NA)

wd_data$TAVG <- replace(wd_data$TAVG, wd_data$TAVG < -128.6, NA)
wd_data$TMAX <- replace(wd_data$TMAX, wd_data$TMAX < -128.6, NA)
wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN < -128.6, NA)

The final cleaning step is a comparison check between TMAX and TMIN. Since we will be imputing the average daily temperature for observations in which TAVG is not recorded, we should check for conditions in which TMIN is recorded as being greater than TMAX.

Condition Check for Temperatures

nrow(filter(wd_data, TMIN > TMAX))

## [1] 259

Again, there are a small number of observations, 259, that meet this condition. We will invalidate these observations as before. For our purposes, we will consider TMIN as the invalid value. The same result can be accomplished by invalidating TMAX, or both values. This will be explained later in the Data Imputing section.

Replace Invalid Observations

wd_data$TMIN <- replace(wd_data$TMIN, wd_data$TMIN > wd_data$TMAX, NA)

summary(wd_data[3:5])

##       TAVG              TMAX             TMIN       
##  Min.   :-111.6    Min.   :-97.8    Min.   :-114.9  
##  1st Qu.:  36.5    1st Qu.: 48.4    1st Qu.:  30.0  
##  Median :  53.8    Median : 66.9    Median :  44.1  
##  Mean   :  51.8    Mean   : 63.9    Mean   :  43.0  
##  3rd Qu.:  70.5    3rd Qu.: 82.0    3rd Qu.:  57.9  
##  Max.   : 134.1    Max.   :134.1    Max.   : 134.1  
##  NA's   :2422243   NA's   :409425   NA's   :415738

Let’s review the updated number of observations compared to the number of complete observations.

Complete Observation Statistics

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

Report Complete Observation Statistics

##   Number of Observation Stations:      14,297
## Number of Potential Observations:   5,232,702
##    Number of Actual Observations:   4,587,659
##  Number of Complete Observations:   2,165,416

At this point, we have completed our preliminary cleaning of the observation data. We can move on to calculating an average daily temperature for observations in which TAVG is NA.

Data Imputing

Brief Review of the Data

kable(head(wd_data), caption = "wd_data")

ID	Date	TAVG	TMAX	TMIN
AE000041196	20160101	73.22	NA	59.90
AE000041196	20160102	75.74	84.74	65.84
AE000041196	20160103	68.72	74.30	NA
AE000041196	20160104	66.56	72.32	NA
AE000041196	20160105	65.84	72.68	57.02
AE000041196	20160106	68.18	NA	NA

Roughly half of the observations have a valid value for TAVG. For the remaining observations we will derive a daily average temperature based on TMAX and TMIN.

There are roughly 5 different methods for calculating average temperatures. Since we do not have temperature readings by hour or minute the most common method remaining is to take the mean of the daily high and daily low, TMAX and TMIN respectively.

There are some biases associated with this method, however. The National Research Center for Statistics and the Environment at the University of Washington produced a very good paper on the differing methods for median temperature collections.

http://www.nrcse.washington.edu/NordicNetwork/reports/temp.pdf

However, the calculation of the mean daily temperature is the same process that NOAA uses.

The most common processing steps used to turn daily weather data into climate data products involve straightforward mathematical operations such as averaging or adding measurements. For instance, a station’s daily average (mean) temperature is calculated by averaging the minimum and maximum temperatures for the day: daily average = (minimum + maximum) / 2.

Source: https://www.climate.gov/maps-data/primer/processing-climate-data

When the average temperature is created from TMAX and TMIN, and one or both of the values is NA, the result will be NA as well.

The first step will be to create a new variable for the derived average temperature, DAVG.

Create Derived Average Temperature

wd_data <- wd_data %>%
           mutate(DAVG = (TMAX + TMIN) / 2)

kable(head(wd_data, 10), caption = "wd_data")

ID	Date	TAVG	TMAX	TMIN	DAVG
AE000041196	20160101	73.22	NA	59.90	NA
AE000041196	20160102	75.74	84.74	65.84	75.29
AE000041196	20160103	68.72	74.30	NA	NA
AE000041196	20160104	66.56	72.32	NA	NA
AE000041196	20160105	65.84	72.68	57.02	64.85
AE000041196	20160106	68.18	NA	NA	NA
AE000041196	20160107	67.28	74.12	NA	NA
AE000041196	20160108	66.74	77.00	58.46	67.73
AE000041196	20160109	67.28	80.24	56.48	68.36
AE000041196	20160110	69.26	79.34	57.92	68.63

Note: The derived average temperature, DAVG, does not need to be equal to TAVG. This is because TAVG is the 24 hour average temperature recorded for the period ending at 2400 UTC rather than local midnight.

The next step will replace the value of TAVG with the value of DAVG if the current value of TAVG is NA. If the values of both TAVG and DAVG are NA then the observation is invalid or incomplete with a value of NA as it’s observation.

Update Average Temperature

summary(wd_data$TAVG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -111.6    36.5    53.8    51.8    70.5   134.1 2422243

wd_data <- wd_data %>%
           transform(TAVG = ifelse(is.na(TAVG), DAVG, TAVG))

summary(wd_data$TAVG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -111.64   39.47   55.49   53.53   70.34  134.06   25579

Let’s review the results of the imputation on the number of observations compared to the number of complete observations.

Complete observation statistics

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

Report complete observation statistics

##   Number of Observation Stations:      14,297
## Number of Potential Observations:   5,232,702
##    Number of Actual Observations:   4,587,659
##  Number of Complete Observations:   4,562,080

The final steps remaining are to remove the variables in wd_data no longer needed (TMAX, TMIN, DAVG) and remove the remaining invalid observations. The removal of the invalid transactions will be accomplished by filtering the data based on complete.cases, meaning there are no NA values present in any of the observational variables. This is why it didn’t matter whether we replaced TMIN or TMAX values previously. The result of the average calculation would have resulted in NA either way and thus result in the observation removal during the complete.cases filtering. This is also why we need to remove the unneeded variables first because they may contain NA values and complete.cases would remove the entire observation even though we have a complete observation by our earlier definition.

Remove Unneeded Variables

wd_data <- wd_data %>%
           select(-TMAX, -TMIN, -DAVG)

Remove Invalid Observations

summary(wd_data$TAVG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -111.64   39.47   55.49   53.53   70.34  134.06   25579

wd_data <- wd_data[complete.cases(wd_data),]

summary(wd_data$TAVG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -111.64   39.47   55.49   53.53   70.34  134.06

Let’s review the final statistics for the cleaned data and include the overall percentage of completeness.

Complete Observation Statistics

tot_obs <- nrow(wd_data)
com_obs <- nrow(subset(wd_data, !is.na(wd_data$TAVG)))

tot_sta <- length(unique(wd_data$ID))
pot_obs <- tot_sta * nbr_day

pct_com <- round(com_obs/pot_obs * 100, 2)

Report Complete Observation Statistics

##   Number of Observation Stations:      14,285
## Number of Potential Observations:   5,228,310
##    Number of Actual Observations:   4,562,080
##  Number of Complete Observations:   4,562,080
##          Completeness Percentage:      87.26%

We lost 12 observation stations in the process of cleaning the data
The completeness percentage falls well below 95%

Data Completion

We will be summarizing the 2016 Daily Weather Data by:

Day of the year
Week of the year
Month of the year

The current Date variable is in integer format so we will use the package lubridate to perform some date coercions. The first step is to convert Date to date format. Next, we can get the Yday, Week, and Month values.

Manipulate Dates

wd_data$Date <- ymd(wd_data$Date)

wd_data$Month <- month(wd_data$Date)
wd_data$Week  <- week(wd_data$Date)
wd_data$Yday  <- yday(wd_data$Date)

kable(head(wd_data), caption = "wd_data")

ID	Date	TAVG	Month	Week	Yday
AE000041196	2016-01-01	73.22	1	1	1
AE000041196	2016-01-02	75.74	1	1	2
AE000041196	2016-01-03	68.72	1	1	3
AE000041196	2016-01-04	66.56	1	1	4
AE000041196	2016-01-05	65.84	1	1	5
AE000041196	2016-01-06	68.18	1	1	6

We now have the needed variables and format required for analyzing and summarizing the data. We can remove the variable Date and reorder the variables in wd_data for cleaner viewing. We can accomplish both steps with one command.

Remove Date and Reorder

wd_data <- wd_data %>%
           select(ID, Month, Week, Yday, TAVG)

kable(head(wd_data), caption = "wd_data")

ID	Month	Week	Yday	TAVG
AE000041196	1	1	1	73.22
AE000041196	1	1	2	75.74
AE000041196	1	1	3	68.72
AE000041196	1	1	4	66.56
AE000041196	1	1	5	65.84
AE000041196	1	1	6	68.18

Saving Temperature Data

The completed yearly temperature data can now be saved in .Rds format to be used in the subsequent projects.

Save Temperature Data

saveRDS(wd_data, "~/NOAA Data/RDS WDT 1900-2016/ww_data_2016.rds")

Key Findings

The most important finding from processing the 2016 yearly weather data is the level of completeness for temperature observations. The final number of temperature stations in the 2016 data collection is 14,285 and the percentage of completion is 87.26%.

Using a very liberal and accomodating acceptance criteria, greater than 1 out of 10 observations is missing from the record. When comparing that statistic to the claim that NOAA and NASA believe that worldwide temperatures are increasing with a scientific certainty greater than 95%, one has to wonder at the validity of the assertion.

This doesn’t mean that the claim that temperatures are increasing is necessarily false. It simply means that the missing data would hinder a true scientific conclusion. Therefore, it has to be assumed that the missing data is ignored or that it is “filled in” using imputational means.

Based on the findings from the previous projects in this study, a key item to explore will be the completeness of observations across Hemispheres and Quadrants. Since each area has it’s own unique historical temperature patterns, the completeness of observations will be critical to determine the impact to global averages.

In addition, determining if there is any pattern related to observation completion from a seasonal perspective (fewer observations in winter periods) will be valuable as well.

These items will be covered in future projects as part of the study.

Next Steps

The next project will create a processing loop that will perform the cleaning and preparation of the RAW yearly weather data using the criteria from this project and apply it to the years 1900-2016.

In addition, the project will record pertinent statistics by year for overall reporting of temperature observation data.

image.png

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.2  lubridate_1.6.0 knitr_1.15.1    dplyr_0.5.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     digest_0.6.12    rprojroot_1.2    assertthat_0.2.0
##  [5] plyr_1.8.4       R6_2.2.0         DBI_0.6-1        backports_1.0.5 
##  [9] magrittr_1.5     evaluate_0.10    highr_0.6        stringi_1.1.5   
## [13] lazyeval_0.2.0   rmarkdown_1.5    tools_3.4.0      stringr_1.2.0   
## [17] yaml_2.1.14      compiler_3.4.0   htmltools_0.3.6  tibble_1.3.0

原文链接：https://rstudio-pubs-static.s3.amazonaws.com/275033_a87f8c52176d4aed8880d56dbf6c3ada.html

2020-09-01-Weather Data Exploratory Analysis - Part 5

2020-09-01-Weather Data Exploratory Analysis - Part 5

2016 Weather Data Exploratory Analysis - Part 5

NOAA Yearly Weather Data Processing

Processing 2016 Temperature Data

Project Synopsis

Yearly Weather Data

Read Weather Data File

Data Exploration and Processing

Data Cleaning

Data Imputing

Data Completion

Saving Temperature Data

Key Findings

Next Steps