SQL数据库的SELECT操作将是你学过的最为有用的数据操作技术之一。除了关系数据库,你能够在广泛的工具中找到它的影子。不同数据库管理系统的语言相当一致,你可以将这项技术应用于公共的关系数据库UCSC基因组浏览器与Ensembl数据库。
SELECT 基本操作
继续采用上节例子数据gwascat.db
,以下命令可以查找所有信息。
sqlite> SELECT * FROM gwascat;
1|2014-08-02|24388013|Ferreira MA|2013-12-30|...
2|2014-08-02|24388013|Ferreira MA|2013-12-30|...
...
基本格式为SELECT <colname> FROM <table>
, *
代表所有列。
SQLite
可以在终端而不进入交互式界面直接操作,方便我们直接查询结果,如
$ sqlite3 gwascat.db "SELECT * FROM gwascat" > result.txt
$ head result.txt -n 1
1|2014-08-02|24388013|Ferreira MA|2013-12-30|...
LIMIT 限制行数
以上语句返回了所有结果,可通过LIMIT
限制结果行数:
sqlite> SELECT * FROM gwascat LIMIT 2;
1|2014-08-02|24388013|Ferreira MA|2013-12-30|...
2|2014-08-02|24388013|Ferreira MA|2013-12-30|...
选择特定列
前面我们选择了所有列,可以返回指定列的结果:
sqlite> SELECT trait, chrom , position , strongest_risk_snp , pvalue FROM gwascat LIMIT 5;
Asthma and hay fever|6|32658824|rs9273373|4.0e-14
Asthma and hay fever|4|38798089|rs4833095|5.0e-12
Asthma and hay fever|5|111131801|rs1438673|3.0e-11
Asthma and hay fever|2|102350089|rs10197862|4.0e-11
Asthma and hay fever|17|39966427|rs7212938|4.0e-10
清楚起见,进行以下展示设置(使用column模式,展示表格的头部):
sqlite> .mode column
sqlite> .header on
再次查询结果显示为:
sqlite> SELECT trait, chrom , position , strongest_risk_snp , pvalue FROM gwascat LIMIT 5;
trait chrom position strongest_risk_snp pvalue
-------------------- ----- --------- ------------------ -------
Asthma and hay fever 6 32658824 rs9273373 4.0e-14
Asthma and hay fever 4 38798089 rs4833095 5.0e-12
Asthma and hay fever 5 111131801 rs1438673 3.0e-11
Asthma and hay fever 2 102350089 rs10197862 4.0e-11
Asthma and hay fever 17 39966427 rs7212938 4.0e-10
如果列很多的话可以恢复更紧凑的list布局(.mode list
)。
注意:即使是相同的操作,结果顺序也并非固定,如果需要固定顺序,采用ORDER BY命令(见下方)。
ORDER BY 排序结果
ORDER BY
接需要排序的列,例如根据P值进行排序:
sqlite> SELECT trait, chrom , position , strongest_risk_snp , pvalue FROM gwascat ORDER BY pvalue LIMIT 5;
trait chrom position strongest_risk_snp pvalue
------------- ----- -------- ------------------ ------
Brain imaging rs10932886
Brain imaging rs429358
Brain imaging rs7610017
Brain imaging rs6463843
Brain imaging rs2075650
注意,某列若包含多种数据类型,将按照 NULL(缺失) < INTEGER(整数) < REAL(实数) < TEXL(文本) < BLOB(二进制文件) 的顺序展示,这里首先展示的是缺失值。
默认升序,增加DESC
表示降序,如:
sqlite> SELECT trait , chrom, position , strongest_risk_snp , pvalue FROM gwascat ORDER BY pvalue DESC LIMIT 5;
trait chrom position strongest_risk_snp pvalue
------------------------------------------ ----- --------- ------------------ ----------
Serum protein levels (sST2) 18 10382886 rs206548 90000000.0
Periodontitis (Mean PAL) 14 68893914 rs12050161 4000000.0
Coronary artery disease or ischemic stroke 11 116778201 rs964184 2.0e-05
Lipid traits 1 62447248 rs10158897 9.0e-06
Rheumatoid arthritis 15 38541832 rs8032939 9.0e-06
这里前两行的P值非常可疑,有可能是科学记数法负号丢失造成的错误,再次说明不要盲目相信数据。
WHERE筛选结果
可以通过WHERE
命令设定条件筛选结果,比如说筛选名字为“rs206548”的snp所在行:
sqlite> SELECT trait, chrom, position , strongest_risk_snp , pvalue FROM gwascat WHERE strongest_risk_snp == "rs206548";
trait chrom position strongest_risk_snp pvalue
--------------------------- ----- -------- ------------------ ----------
Serum protein levels (sST2) 18 10382886 rs206548 90000000.0
这里使用 =
或者 ==
代表相等,其它常见的过滤操作包括:
因为NULL与任意值不等,匹配缺失值的不能使用 =
,而要采用IS
操作,如:
sqlite> SELECT trait, chrom, position , strongest_risk_snp , pvalue FROM gwascat WHERE pvalue IS NOT NULL ORDER BY pvalu
e LIMIT 5;
trait chrom position strongest_risk_snp pvalue
----------------------------------------- ----- --------- ------------------ ------
HDL cholesterol 16 56959412 rs3764261 0.0
Age-related macular degeneration 10 122454932 rs10490924 0.0
Age-related macular degeneration 1 196710325 rs10737680 0.0
Urate levels 4 9942428 rs12498742 0.0
Vascular endothelial growth factor levels 6 43957789 rs4513773 0.0
可以通过LIKE进行文本的模糊匹配(使用"%代表通配符"):
sqlite> SELECT author , pubmedid , journal FROM gwascat WHERE author LIKE "%Ferreira%" LIMIT 5;
author pubmedid journal
----------- -------- ----------------------
Ferreira MA 24388013 J Allergy Clin Immunol
Ferreira MA 24388013 J Allergy Clin Immunol
Ferreira MA 24388013 J Allergy Clin Immunol
Ferreira MA 24388013 J Allergy Clin Immunol
Ferreira MA 24388013 J Allergy Clin Immunol
WHERE
条件支持与,或,非逻辑操作,而多个逻辑之间也可以通过括号组合,如选择在染色体1,2,3上且P值小于10^-11结果:
sqlite> SELECT trait, chrom, position , strongest_risk_snp , pvalue FROM gwascat WHERE (chrom == '1' OR chrom =='2' OR chrom == '3') AND pvalue < 1e-11 ORDER BY pvalue LIMIT 5;
trait chrom position strongest_risk_snp pvalue
-------------------------------- ----- --------- ------------------ ---------------------
Age-related macular degeneration 1 196710325 rs10737680 0.0
Bilirubin levels 2 233763993 rs6742078 4.94065645841247e-324
Butyrylcholinesterase levels 3 165773492 rs1803274 6.0e-262
Age-related macular degeneration 1 196690107 rs1061170 1.0e-261
Metabolic traits 2 73591809 rs13391552 5.0e-252
这里的多个OR
操作可以改写为更简洁的IN
:
sqlite> SELECT trait, chrom, position , strongest_risk_snp , pvalue FROM gwascat WHERE (chrom IN ('1', '2', '3')) AND pv
alue < 1e-11 ORDER BY pvalue LIMIT 5;
trait chrom position strongest_risk_snp pvalue
-------------------------------- ----- --------- ------------------ ---------------------
Age-related macular degeneration 1 196710325 rs10737680 0.0
Bilirubin levels 2 233763993 rs6742078 4.94065645841247e-324
Butyrylcholinesterase levels 3 165773492 rs1803274 6.0e-262
Age-related macular degeneration 1 196690107 rs1061170 1.0e-261
Metabolic traits 2 73591809 rs13391552 5.0e-252