转载:
https://biozx.top/gdc.html
image
GDC(https://gdc.cancer.gov/) Application Programming Interface 简称API,是GDC开放对外的应用接口。有许多功能,包括数据查询、数据提交、文件下载、metadata、注释、BAM Slicing等。下面主要介绍数据查询和下载功能。
GDC网站有详尽的使用教程(https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#authentication),下面按照我的理解说一下。
参数介绍
一个查询请求需要包含以下参数:
- filters 参数:限定查询; 
- format 参数:限定返回的文件格式,JSON, TSV, XML 
- fields 参数:限定返回文件中必须包含哪些列; 
- size 参数:限定最多返回多少条记录; 
请求可以用HTTP GET or HTTP POST两种方法。但是GET会限定URL的长度,所以我一般使用POST方法。
应用实例
以下是LAML的gdc_manifest.2018-01-08T04_35_55.788687.txt的前几行,要找到htseq.counts.gz文件对应的样本id以及样本类型(肿瘤样本或normal样本)等信息,我们可以通过GDC的API接口实现这一功能:
id  filename    md5 size    state
fdf76d41-8909-49be-83fb-d5ce8715b7e9    a3376c90-202c-42f6-a120-98d650e0765d.htseq.counts.gz    eaf6cb895b36d591d28e2e153176ca7e    259927  live
3e1f3a96-6d9a-47aa-b319-d0e3e3fd12f9    d398a330-6a57-4172-9f2a-d6187fa2c71d.htseq.counts.gz    309d1459a02bce16766f6fe24e611930    253383  live
5cf34517-66c6-4b12-b83b-4cb71309f68a    4fd2de09-663c-4452-8cc8-1733a78bf71f.htseq.counts.gz    3b7ac338c1d8bf4721c5a95e41832702    259239  live
47c9d58f-d7b4-41fe-b500-d95b537dc21e    ccb70fba-ed81-4b83-90e1-99375e0db559.htseq.counts.gz    1d1753b6af24831fd254ea031e33032e    258247  live
fe4f0b9a-46c8-4419-8315-0646418e3591    0d8702cc-d3db-4daf-94a8-b37103d771a6.htseq.counts.gz    e18c8cf1ab7cb3bd82196451ab94d5f2    257574  live
实现的python脚本如下:
import requestsimport json
cases_endpt = 'https://api.gdc.cancer.gov/files'filt={
    "op":"in",
    "content":{
        "field":"files.file_id",##file_id就是gdc_manifest的id列
        "value":[
            "fdf76d41-8909-49be-83fb-d5ce8715b7e9",
            "3e1f3a96-6d9a-47aa-b319-d0e3e3fd12f9",
            "5cf34517-66c6-4b12-b83b-4cb71309f68a",
            "47c9d58f-d7b4-41fe-b500-d95b537dc21e",
            "fe4f0b9a-46c8-4419-8315-0646418e3591"
        ]
    }}params = {'filters':json.dumps(filt),
    "format":"tsv",
    "fields":"file_name,cases.samples.sample_type_id,cases.samples.sample_type,cases.samples.submitter_id",
    "size":"100"}response = requests.get(cases_endpt, params = params)print(response.content)  
 #运行结果如下,得到了文件对应的样本id,以及样本类型等:
b'file_name cases.0.samples.0.submitter_id  cases.0.samples.0.sample_type_id    cases.0.samples.0.sample_type   id                                          
a3376c90-202c-42f6-a120-98d650e0765d.htseq.counts.gz    TCGA-AB-2927-03A    3   Primary Blood Derived Cancer - Peripheral Blood fdf76d41-8909-49be-83fb-d5ce8715b7e9                                            
d398a330-6a57-4172-9f2a-d6187fa2c71d.htseq.counts.gz    TCGA-AB-2843-03A    3   Primary Blood Derived Cancer - Peripheral Blood 3e1f3a96-6d9a-47aa-b319-d0e3e3fd12f9                                            
4fd2de09-663c-4452-8cc8-1733a78bf71f.htseq.counts.gz    TCGA-AB-2859-03A    3   Primary Blood Derived Cancer - Peripheral Blood 5cf34517-66c6-4b12-b83b-4cb71309f68a                                            
ccb70fba-ed81-4b83-90e1-99375e0db559.htseq.counts.gz    TCGA-AB-2931-03A    3   Primary Blood Derived Cancer - Peripheral Blood 47c9d58f-d7b4-41fe-b500-d95b537dc21e                                            
0d8702cc-d3db-4daf-94a8-b37103d771a6.htseq.counts.gz    TCGA-AB-2897-03A    3   Primary Blood Derived Cancer - Peripheral Blood fe4f0b9a-46c8-4419-8315-0646418e3591