2020-2-23 更新:
import 输入其他格式的数据报错的问题,可用参数format 解决,例如format = "\t"。帮助文档中该参数解释为:
An optional character string code of file format, which can be used to override the format inferred from file. Shortcuts include: “,” (for comma-separated values), “;” (for semicolon-separated values), and “|” (for pipe-separated values).
此外,发现import可以使用fread的参数,例如skip="ID",表示从ID所在行开始读取。
所以,还是吹爆import吧~
read.delim替代read.table比较好用,只有一点点默认参数的差别。
花花写于2019.8.9
为了丰富全国巡讲的课件“文件读写”部分(讲义也写了3/4了),我从《高效R语言编程》中学了几个新的读取和导出文件的函数,特此记录。
原书英文电子版免费:https://csgillespie.github.io/efficientR/input-output.html#prerequisites-4
昨天推文忘记改题目,也是服了自己。
1.准备工作
需要用到几个包
if(!require("rio")) install.packages("rio")
if(!require("readr")) install.packages("readr")
if(!require("data.table")) install.packages("data.table")
if(!require("WDI")) install.packages("WDI")
library(rio)
library(readr)
library(data.table)
library(WDI)
然后是全部的示例数据,在生信星球
公众号后台回复“读取”即可获得。
2. 高效读写 - rio包
rio 包是一个名副其实的多功能数据读写包,支持多种格式:.csv
, .feather
, .json
, .dta
, .xls
, .xlsx
使用方法之简单,令人发指,不需要什么附加的参数,只要是支持的格式,一律一行代码读进来,想导出什么格式,直接给文件名加后缀就行了。
co2 = import("CO2.csv")
head(co2)
#> V1 time value
#> 1 1 1959.000 315.42
#> 2 2 1959.083 316.31
#> 3 3 1959.167 316.50
#> 4 4 1959.250 317.56
#> 5 5 1959.333 318.13
#> 6 6 1959.417 318.00
export(co2,"CO2.txt")
voyages = import("voc_voyages.tsv")
head(voyages)
#> number number_sup trip trip_sup boatname master
#> 1 1 1 AMSTERDAM Jan Jakobsz. Schellinger
#> 2 2 1 DUIFJE Simon Lambrechtsz. Mau
#> 3 3 1 HOLLANDIA Jan Dignumsz. van Kwadijk+
#> 4 4 1 MAURITIUS Jan Jansz. Molenaar
#> 5 5 1 LANGEBARK Hans Huibrechtsz. Tonneman
#> 6 6 1 MAAN
#> tonnage type_of_boat built bought hired yard chamber departure_date
#> 1 260 1594 A 1595-04-02
#> 2 50 pinas 1594 A 1595-04-02
#> 3 460 1594 A 1595-04-02
#> 4 460 1594 A 1595-04-02
#> 5 300 1598-03-25
#> 6 NA 1598-03-25
#> departure_harbour cape_arrival cape_departure cape_call arrival_date
#> 1 Texel TRUE 1596-06-06
#> 2 Texel TRUE 1596-06-06
#> 3 Texel TRUE 1596-06-06
#> 4 Texel TRUE 1596-06-06
#> 5 Zeeland TRUE 1599-03-01
#> 6 Zeeland TRUE
#> arrival_harbour next_voyage
#> 1 Engano NA
#> 2 Engano 5001
#> 3 Engano 5002
#> 4 Engano 5003
#> 5 Bantam 5010
#> 6 NA
#> particulars
#> 1 from 04-08 till 11-08 in the Mosselbaai; from 13-09 till 07-10 in the Ampalazabaai; from 09-10 till 13-12 in S. Augustins Bay, where before departure 127 of the 249 men were still alive; 11-01 till 21-01 at Ste. Marie I.; from 23-01 till 12-02 in the Bay of Antongil. The AMSTERDAM was set on fire near Bawean, 11-01-1597.
#> 2 HOLLANDIA on 26-10-1595; he was succeeded by Hendrik Jansz.
#> 3 Jan Dignumsz. died on 29-05-1595 and Mau was his successor.
#> 4 Jan Jansz. died on 25-12-1596 and Hendrik Jansz. was his successor.
#> 5 other.
#> 6
export(voyages, "voc_voyages.xlsx")
#> Note: zip::zip() is deprecated, please use zip::zipr() instead
#读取在线文件也木有问题
capitals = import("https://github.com/mledoze/countries/raw/master/countries.json")
3.其他方法
我刚看到import的时候是震惊的,如果它什么数据都能读,那还学什么read.table,read.csv呀,那不是画蛇添足么。。。实际上并不。
接着往下看,除了import意外的另外几种方法:
(1)read.table
及其同类的read.csv
和read.delim
(2)readr
包的read_table
、read_csv
等
(3)data.table
包的fread
函数
df_co2 = read.csv("CO2.csv")
df_co2_dt = readr::read_csv("CO2.csv")
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> time = col_double(),
#> value = col_double()
#> )
df_co2_readr = data.table::fread("CO2.csv")
voyages_base = read.delim("voc_voyages.tsv")
voyages_readr = readr::read_tsv("voc_voyages.tsv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> number = col_double(),
#> number_sup = col_logical(),
#> trip = col_double(),
#> tonnage = col_double(),
#> hired = col_logical(),
#> departure_date = col_date(format = ""),
#> cape_arrival = col_date(format = ""),
#> cape_departure = col_date(format = ""),
#> cape_call = col_logical(),
#> arrival_date = col_date(format = ""),
#> next_voyage = col_double()
#> )
#> See spec(...) for full column specifications.
#> Warning: 77 parsing failures.
#> row col expected actual file
#> 1023 hired 1/0/T/F/TRUE/FALSE 1664 'voc_voyages.tsv'
#> 1025 hired 1/0/T/F/TRUE/FALSE 1664 'voc_voyages.tsv'
#> 1030 hired 1/0/T/F/TRUE/FALSE 1664 'voc_voyages.tsv'
#> 1034 hired 1/0/T/F/TRUE/FALSE 1664/5 'voc_voyages.tsv'
#> 1035 hired 1/0/T/F/TRUE/FALSE 1665 'voc_voyages.tsv'
#> .... ..... .................. ...... .................
#> See problems(...) for more details.
voyages_dt = data.table::fread("voc_voyages.tsv")
4.差异在哪
书中提到fread和read_*对有异常值的数值型变量进行的不同处理
For columns in which the first 1000 rows are of one type but which contain anomalies, such as ‘built’ and ‘departure_data’ in the shipping example, fread() coerces the result to characters. read_csv() and siblings, by contrast, keep the class that is correct for the first 1000 rows and sets the anomalous records to NA. This is illustrated in 5.1, where read_tsv() produces a numeric class for the ‘built’ variable, ignoring the non-numeric text in row 2841.
奇怪的是我跑代码发现书上所说的built一列并无区别,都是character。
class(voyages_dt$built)
#> [1] "character"
class(voyages_readr$built)
#> [1] "character"
这个并不重要,但我有个意外发现,我很好奇差异在哪,就做了一些尝试:
首先是维度
dim(voyages)
#> [1] 8131 22
dim(voyages_base)
#> [1] 8131 22
dim(voyages_dt)
#> [1] 8131 22
dim(voyages_readr)
#> [1] 8131 22
行列数是一致的。
然后是列名
#返回结果一致
colnames(voyages)
colnames(voyages_base)
colnames(voyages_dt)
colnames(voyages_readr)
#> [1] "number" "number_sup" "trip"
#> [4] "trip_sup" "boatname" "master"
#> [7] "tonnage" "type_of_boat" "built"
#> [10] "bought" "hired" "yard"
#> [13] "chamber" "departure_date" "departure_harbour"
#> [16] "cape_arrival" "cape_departure" "cape_call"
#> [19] "arrival_date" "arrival_harbour" "next_voyage"
#> [22] "particulars"
可以用identical来查看两两是否一致。。。。
第三是每列的数据类型
这个有难度,虽然str()函数是可以看的,但是对于这个例子来说,并不方便比较。我写了一个函数
dfc =function(kk,name=F){
kk=data.frame(kk)
x=vector()
n=0
for(i in colnames(kk)){
n=n+1
x[[n]]=class(kk[,i])
}
if(name)names(x)=colnames(kk)
x
}
#拼在一起做个对比
cn = cbind(colnames(voyages_dt),
dfc(voyages),
dfc(voyages_dt),
dfc(voyages_readr),
dfc(voyages_base));cn
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] "number" "integer" "integer" "numeric" "integer"
#> [2,] "number_sup" "character" "character" "logical" "character"
#> [3,] "trip" "integer" "integer" "numeric" "integer"
#> [4,] "trip_sup" "character" "character" "character" "character"
#> [5,] "boatname" "character" "character" "character" "character"
#> [6,] "master" "character" "character" "character" "character"
#> [7,] "tonnage" "integer" "integer" "numeric" "integer"
#> [8,] "type_of_boat" "character" "character" "character" "character"
#> [9,] "built" "character" "character" "character" "character"
#> [10,] "bought" "character" "character" "character" "character"
#> [11,] "hired" "character" "character" "logical" "character"
#> [12,] "yard" "character" "character" "character" "character"
#> [13,] "chamber" "character" "character" "character" "character"
#> [14,] "departure_date" "character" "character" "Date" "character"
#> [15,] "departure_harbour" "character" "character" "character" "character"
#> [16,] "cape_arrival" "character" "character" "Date" "character"
#> [17,] "cape_departure" "character" "character" "Date" "character"
#> [18,] "cape_call" "logical" "logical" "logical" "character"
#> [19,] "arrival_date" "character" "character" "Date" "character"
#> [20,] "arrival_harbour" "character" "character" "character" "character"
#> [21,] "next_voyage" "integer" "integer" "numeric" "integer"
#> [22,] "particulars" "character" "character" "character" "character"
第18列有差异!另外有几列智能地被readr识别为“Date”。
head(voyages[,18])
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
head(voyages_dt[,18])
#> cape_call
#> 1: TRUE
#> 2: TRUE
#> 3: TRUE
#> 4: TRUE
#> 5: TRUE
#> 6: TRUE
head(voyages_base[,18])
#> [1] "true" "true" "true" "true" "true" "true"
head(voyages_readr[,18])
#> # A tibble: 6 x 1
#> cape_call
#> <lgl>
#> 1 TRUE
#> 2 TRUE
#> 3 TRUE
#> 4 TRUE
#> 5 TRUE
#> 6 TRUE
关于读取文件所需时间的差异,书中给了一张图:
5.生信数据实例
书中给的例子都是规则的数据框,所以都没有报错或读取错误的情况。我从果子师兄那里拿了几个示例数据,来看实际应用中是否都能读取成功。
(1)简单txt和csv
## .csv
#file.show("B cell receptor signaling pathway.csv")
csv1 = read.csv(file = "B cell receptor signaling pathway.csv");dim(csv1)
#> [1] 18 169
csv2 = data.table::fread(file = "B cell receptor signaling pathway.csv");dim(csv2)
#> [1] 18 169
csv3 = import("B cell receptor signaling pathway.csv");dim(csv3)
#> [1] 18 169
csv4 <- read_csv("B cell receptor signaling pathway.csv");dim(csv4)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> X1 = col_character()
#> )
#> See spec(...) for full column specifications.
#> [1] 18 169
## .txt
txt1 <- read.table("platformMap.txt",header = T,sep = "\t");dim(txt1)
#> [1] 74 6
txt2 <- data.table::fread("platformMap.txt");dim(txt2)
#> [1] 74 6
txt3 <- import("platformMap.txt");dim(txt3)
#> [1] 74 6
txt4 <- read_tsv("platformMap.txt");dim(txt4)
#> Parsed with column specification:
#> cols(
#> title = col_character(),
#> gpl = col_character(),
#> bioc_package = col_character(),
#> manufacturer = col_character(),
#> organism = col_character(),
#> data_row_count = col_double()
#> )
#> [1] 74 6
csv 都没有问题,txt则是3个直接读取成功,一个需要少量加参数。
(2)GPL注释表格
特点是前12行为井号开头的注释行:如图
geo1 <- read.table("GPL6244-17930.txt",header = T,sep="\t",fill = T);dim(geo1)
#> [1] 33297 12
geo2 <- data.table::fread("GPL6244-17930.txt");dim(geo2)
#> [1] 33297 12
geo3 <- import("GPL6244-17930.txt");dim(geo3)
#> [1] 33297 12
geo4 <- read_tsv("GPL6244-17930.txt",skip = 12)
#> Parsed with column specification:
#> cols(
#> ID = col_double(),
#> GB_LIST = col_character(),
#> SPOT_ID = col_character(),
#> seqname = col_character(),
#> RANGE_GB = col_character(),
#> RANGE_STRAND = col_character(),
#> RANGE_START = col_double(),
#> RANGE_STOP = col_double(),
#> total_probes = col_double(),
#> gene_assignment = col_character(),
#> mrna_assignment = col_character(),
#> category = col_character()
#> )
#> Warning: 8936 parsing failures.
#> row col expected actual file
#> 3169 RANGE_START a double --- 'GPL6244-17930.txt'
#> 3169 RANGE_STOP a double --- 'GPL6244-17930.txt'
#> 3752 RANGE_START a double --- 'GPL6244-17930.txt'
#> 3752 RANGE_STOP a double --- 'GPL6244-17930.txt'
#> 9529 RANGE_START a double --- 'GPL6244-17930.txt'
#> .... ........... ........ ...... ...................
#> See problems(...) for more details.
两个智能派还是可以读取正确,而另外两个则需要加参数才能读取正确:read.table需要fill=T,表示缺值的地方填充空字符串;read_table2则需要指定跳过前12行(也就是井号开头的注释行)。
(3)GEO表达矩阵
## 3.读取GEO数据(带有注释行的txt)
exp1 <- read.table("GSE42872_series_matrix.txt",comment.char="!",header=T);dim(exp1)
#> [1] 33297 7
exp2 <- data.table::fread("GSE42872_series_matrix.txt",skip = 59);dim(exp2)
#> Warning in data.table::fread("GSE42872_series_matrix.txt", skip = 59):
#> Discarded single-line footer: <<!series_matrix_table_end>>
#> [1] 33297 7
exp3 <- import("GSE42872_series_matrix.txt") ;dim(exp3)
#> Warning in fread(dec = ".", input = structure("GSE42872_series_matrix.txt",
#> class = c("rio_tsv", : Stopped early on line 26. Expected 2 fields but
#> found 0. Consider fill=TRUE and comment.char=. First discarded non-empty
#> line: <<!Sample_title "A375 cells 24h Control rep1" "A375 cells 24h Control
#> rep2" "A375 cells 24h Control rep3" "A375 cells 24h Vemurafenib rep1" "A375
#> cells 24h Vemurafenib rep2" "A375 cells 24h Vemurafenib rep3">>
#> [1] 24 2
exp4 <- read_tsv("GSE42872_series_matrix.txt",skip = 59);dim(exp4)
#> Parsed with column specification:
#> cols(
#> ID_REF = col_double(),
#> GSM1052615 = col_double(),
#> GSM1052616 = col_double(),
#> GSM1052617 = col_double(),
#> GSM1052618 = col_double(),
#> GSM1052619 = col_double(),
#> GSM1052620 = col_double()
#> )
#> Warning: 2 parsing failures.
#> row col expected actual file
#> 33298 ID_REF a double !series_matrix_table_end 'GSE42872_series_matrix.txt'
#> 33298 NA 7 columns 1 columns 'GSE42872_series_matrix.txt'
#> [1] 33298 7
此时需要加的参数更多咯,import无参数可加,虽然没报错,但是读取的并不对。
(4)读取TCGA甲基化文件
tcga1 <- read.table("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt",header=T,fill = T,sep = "\t")
tcga2 <- data.table::fread("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt")
tcga3 <- import("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt")
tcga4 <- read_tsv("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt")
#> Parsed with column specification:
#> cols(
#> `Composite Element REF` = col_character(),
#> Beta_value = col_double(),
#> Chromosome = col_character(),
#> Start = col_double(),
#> End = col_double(),
#> Gene_Symbol = col_character(),
#> Gene_Type = col_character(),
#> Transcript_ID = col_character(),
#> Position_to_TSS = col_character(),
#> CGI_Coordinate = col_character(),
#> Feature_Type = col_character()
#> )
除了read.table 需要加fill=T,其余都没什么问题。
(5)读取TCGA数据RNAseq data counts 文件
rna1 <- read.table("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts",header = F,sep = "\t")
rna2 <- data.table::fread("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts")
rna3 <- import("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts")#错误
#> Unrecognized file format. Try specifying with the format argument.
#> Error in .import.default(file = file, ...): Format not supported
rna4 <- read_tsv("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts",col_names =F)
#> Parsed with column specification:
#> cols(
#> X1 = col_character(),
#> X2 = col_double()
#> )
import只支持它认识的几种后缀!其他均报错格式不支持。
(6)终极大招:GEO的soft文件
这个文件很复杂,注释符号不统一,单独的platform表格应该是33297行,但后面又跟了十几万行的其他内容。
第一个特殊之处:
第二个特殊之处:
## 读取GEO平台注释信息soft文件
soft1 <-data.table::fread("GSE42872_family.soft",skip ="ID")
#> Warning in data.table::fread("GSE42872_family.soft", skip = "ID"): Stopped
#> early on line 33404. Expected 12 fields but found 1. Consider fill=TRUE and
#> comment.char=. First discarded non-empty line: <<!platform_table_end>>
soft2 <-import("GSE42872_family.soft") #直接报错
#> Unrecognized file format. Try specifying with the format argument.
#> Error in .import.default(file = file, ...): Format not supported
soft3 <-read.table("GSE42872_family.soft",skip = 105,sep = "\t",header = T,fill = T)
soft4 <- read_tsv("GSE42872_family.soft",skip = 105)
#> Parsed with column specification:
#> cols(
#> ID = col_double(),
#> GB_LIST = col_character(),
#> SPOT_ID = col_character(),
#> seqname = col_character(),
#> RANGE_GB = col_character(),
#> RANGE_STRAND = col_character(),
#> RANGE_START = col_double(),
#> RANGE_STOP = col_double(),
#> total_probes = col_double(),
#> gene_assignment = col_character(),
#> mrna_assignment = col_character(),
#> category = col_character()
#> )
#> Warning: 209188 parsing failures.
#> row col expected actual file
#> 3169 RANGE_START a double --- 'GSE42872_family.soft'
#> 3169 RANGE_STOP a double --- 'GSE42872_family.soft'
#> 3752 RANGE_START a double --- 'GSE42872_family.soft'
#> 3752 RANGE_STOP a double --- 'GSE42872_family.soft'
#> 9529 RANGE_START a double --- 'GSE42872_family.soft'
#> .... ........... ........ ...... ......................
#> See problems(...) for more details.
fread的参数skip="ID"表示的是从以“ID”开头的那一行开始读取。
帮助文档里说到:
If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row. **skip>0 means ignore the first skip rows manually. skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line **(inspired by read.xls in package gdata).
fread很智能的取了前半部分几万行,而read.table和read.tsv则是全部读取了,在这个问题上可以看出,fread无往不利。
总结一下
果子师兄说:read.table是最常用的,fread则是最智能的。
果子师兄还说:fread加一个参数data.table = F,可以让数据读进来就是data.frame。
有了今天的学习成果,我认为他说的很对。对于常见格式,可以先尝试import导入(其实import是根据fread函数写的);
如果失败,再用fread读取,最多是加个参数,理论上就可以成功;
如果还是不行,哈德雷大神写的read_*系列也不是吃素的,拿来试试。
base包有点笨,但他参数多,更灵活,可以作为一个选择。毕竟我学R数据科学的时候,哪知道什么base包。。。
别走!还有个大招
readLines()这个函数也很棒,他是将每行
作为一个字符串
,读取的结果是一个大的字符串向量。
别管三七二十一,先读进来,转化为一个"一列的数据框",然后再分割也是可以的。
关于分割,推荐tidyr::separate()
举个栗子:
if(!require(tidyr)) install.packages("tidyr")
if(require(stringr))install.packages("stringr")
#> Error in install.packages : Updating loaded packages
library(tidyr)
library(stringr)
x = readLines("B cell receptor signaling pathway.csv")
n=str_count(x[1],",")
dfx=data.frame(x=x) %>%
separate(x,into = paste0("V",1: (n+1)),sep = ",");dim(dfx)
#> [1] 19 169
微信公众号生信星球同步更新我的文章,欢迎大家扫码关注!
我们有为生信初学者准备的学习小组,点击查看◀️
想要参加我的线上线下课程,也可加好友咨询🔼
如果需要提问,请先看生信星球答疑公告