Data frames ---From STHDA
A data frame is like a matrix but can have columns with different types (numeric, character, logical).
Rows are observations (individuals) and columns are variables.
Create a data frame
using the function data.frame(), as follow:
friends_data <- data.frame(name = c("A","B","C","D"),
age = c(25,27 ,26,29),
height = c(180, 170, 185, 169),
married = c(T,F,F,T)
)# Create a data frame
friends_data # Print
is.data.frame(friends_data) #To check whether a data is a data frame, use the is.data.frame() function. Returns TRUE if the data is a data frame.
col1 <- c(5, 6, 7, 8, 9)# Numeric vectors
col2 <- c(2, 4, 5, 9, 8)# Numeric vectors
col3 <- c(7, 3, 4, 8, 7)# Numeric vectors
my_data <- cbind(col1, col2, col3)# Combine the vectors by column
my_data
is.data.frame(my_data)
The object “friends_data” is a data frame, but not the object “my_data”. We can convert-it to a data frame using the as.data.frame() function:
class(my_data)# What is the class of my_data? --> matrix
my_data2 <- as.data.frame(my_data)# Convert it as a data frame
class(my_data2)# Convert it as a data frame
As described in matrix section, you can use the function t() to transpose a data frame:
t(friends_data)
Subset a data frame
To select just certain columns from a data frame, you can either refer to the columns by name or by their location (i.e., column 1, 2, 3, etc.).
-
Positive indexing by name and by location
Select rows/columns by positive indexing---Select by row/column names
# Access the data in 'name' column
# dollar sign is used ***$***
friends_data$name
# or use this
friends_data[, 'name']
# Subset columns 1 and 3
friends_data[ , c(1, 3)]
-
Negative indexing
Exclude rows/columns by negative indexing
# Exclude column 1
friends_data[, -1]
-
Index by characteristics
Selection by logical: T F
#We want to select all friends with age >= 27.
friends_data$age >= 27# Identify rows that meet the condition,return lodgic,
#TRUE specifies that the row contains a value of age >= 27.else ,FALSE, not
friends_data[friends_data$age >= 27, ]# Select the rows that meet the condition
#The R code above, tells R to get all rows from friends_data where age >= 27, and then to return all the columns.
#If you don’t want to see all the column data for the selected rows but are just interested in displaying, for example, friend names and age for friends with age >= 27, you could use the following R code:
friends_data[friends_data$age >= 27, c(1, 2)]# Use column locations
# Or use column names
friends_data[friends_data$age >= 27, c("name", "age")]
a. If you’re finding that your selection statement is starting to be inconvenient, you can --put your row and column selections into variables first---, such as:
b. Then you can select the rows and columns with those variables:
age27 <- friends_data$age >= 27
cols <- c("name", "age")
friends_data[age27, cols]
- function :subset()
It’s also possible to use the function subset() as follow.
subset()
# Select friends data with age >= 27
subset(friends_data, age >= 27)
- function: attach() and detach().
Another option is to use the functions attach() and detach().
The function attach() takes a data frame and makes its columns accessible by simply giving their names.
used as follow:
# Attach a data frame
attach(friends_data)
# === Data manipulation ====
friends_data[age>=27, ]
# === End of data manipulation ====
# Detach the data frame
detach(friends_data)
Extend a data frame
a. $ #Add new column in a data frame
# Add group column to friends_data
friends_data$group <- friend_groups
friends_data
b. It’s also possible to use the functions cbind() and rbind() to extend a data frame.
cbind(friends_data, group = friend_groups)
Calculations with data frame or matrix
With numeric data frame, you can use the function
rowSums(),
colSums(),
colMeans(),
rowMeans()
and apply() as described in matrix section.
rowSums() and colSums() functions: Compute the total of each row and the total of each column, respectively.
It’s also possible to perform simple operations on matrice. For example, the following R code multiplies each element of the matrix by 2:
Note that, it’s also possible to use the function apply() to apply any statistical functions to rows/columns of matrices.
Use apply() as follow:
my_data
my_data*2
log2(my_data)#compute the log2 values
rowSums(my_data)# Total of each row
colSums(my_data)# Total of each column
#apply(X, MARGIN, FUN) #X: your data matrix #MARGIN: possible values are 1 (for rows) and 2 (for columns) #FUN: the function to apply on rows/columns
apply(my_data, 1, mean) # Compute row means
apply(my_data, 1, median)# Compute row medians
apply(my_data, 2, mean)# Compute column means