R可从键盘、文本文件、Microsoft Excel和Access、流行的统计软件、特殊格式的文件、多种关系型数据库管理系统、专业数据库、网站和在线服务中导入数据
image.png
1.使用键盘输入数据
也许输入数据最简单的方式就是使用键盘了。有两种常见的方式:
- 用R内置的文本编辑器和直接在代码中嵌入数据。R中的函数edit()会自动调用一个允许手动输入数据的文本编辑器,在Windows上调用函数edit()的结果如图所示,单击列的标题,你就可以用编辑器修改变量名和变量类型(数值型、字符型)。你还可以通过单击未使用列的标题来添加新的变量。编辑器关闭后,结果会保存到之前赋值的对象中(本例中为mydata)。再次调用mydata <- edit(mydata),就能够编辑已经输入的数据并添加新的数据。语句mydata <- edit(mydata)的一种简捷的等价写法是fix(mydata)。
image.png
- 直接在你的程序中嵌入数据集
image.png
2.从带分隔符的文本文件导入数据
可以使用read.table()从带分隔符的文本文件中导入数据。此函数可读入一个表格格式的文件并将其保存为一个数据框。表格的每一行分别出现在文件中每一行。其语法如下
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
函数read.table()的选项
3.导入Excel数据
读取一个Excel文件的最好方式,就是在Excel中将其导出为一个逗号分隔文件(csv),并使用前文描述的方式将其导入R中。此外,你可以用xlsx包直接地导入Excel工作表。
readxl包
install.packages("readxl")
library(readxl)
Usage
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")
read_xls(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")
read_xlsx(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")
Arguments
path
Path to the xls/xlsx file.
sheet
Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Ignored if the sheet is specified via range. If neither argument specifies the sheet, defaults to the first sheet.
range
A cell range to read from, as described in cell-specification. Includes typical Excel ranges like "B3:D87", possibly including the sheet name like "Budget!B2:G14", and more. Interpreted strictly, even if the range forces the inclusion of leading or trailing empty rows or columns. Takes precedence over skip, n_max and sheet.
col_names
TRUE to use the first row as column names, FALSE to get default names, or a character vector giving a name for each column. If user provides col_types as a vector, col_names can have one entry per column, i.e. have the same length as col_types, or one entry per unskipped column.
col_types
Either NULL to guess all from the spreadsheet or a character vector containing one entry per column from these options: "skip", "guess", "logical", "numeric", "date", "text" or "list". If exactly one col_type is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from col_types = NULL, but on a cell-by-cell basis.
na
Character vector of strings to interpret as missing values. By default, readxl treats blank cells as missing data.
trim_ws
Should leading and trailing whitespace be trimmed?
skip
Minimum number of rows to skip before reading anything, be it column names or data. Leading empty rows are automatically skipped, so this is a lower bound. Ignored if range is given.
n_max
Maximum number of data rows to read. Trailing empty rows are automatically skipped, so this is an upper bound on the number of rows in the returned tibble. Ignored if range is given.
guess_max
Maximum number of data rows to use for guessing column types.
progress
Display a progress spinner? By default, the spinner appears only in an interactive session, outside the context of knitting a document, and when the call is likely to run for several seconds or more. See readxl_progress() for more details.
.name_repair
Handling of column names. By default, readxl ensures column names are not empty and are unique. If the tibble package version is recent enough, there is full support for .name_repair as documented in tibble::tibble(). If an older version of tibble is present, readxl falls back to name repair in the style of tibble v1.4.2.
4.导入SPSS数据
IBM SPSS数据集可以通过foreign包中的函数read.spss()导入到R中,也可以使用Hmisc包中的spss.get()函数。函数spss.get()是对read.spss()的一个封装,它可以为你自动设置后者的许多参数,让整个转换过程更加简单一致,最后得到数据分析人员所期望的结果。
5.导入SAS数据
R中设计了若干用来导入SAS数据集的函数,包括foreign包中的read.ssd(), Hmisc包中的sas.get(),以及sas7bdat包中的read.sas7bdat()。如果你安装了SAS, sas.get()是一个好的选择。
5.Rstudio数据导入
image.png