YXF-本体论和生物学数据

一直都感觉模模糊糊,先把弄明白的写下来吧

  1. 本体论就相当于给一个事物或者现象一个确定的命名好让所有人都用这一个词来描述这一事物或现象以免使别人疑惑---也就是制定术语(term)。本体论分为SO 和GO, SO 是给sequence feature命名, GO是给基因功能命名
    基因本体论:
    连接基因与它的一个或多个功能
    分三部分:

  2. cellular component: where does the product exhibit its effect

  3. molecular function: how does it work

  4. biological process:ehat is the propose of the gene product
    基因本体论是个有向环,一个点可以和多个点有关联。
    GO data:
    It contain gene ontology definition file and a gene association file
    GO assocaition file format: GAF format
    Functional analysis:
    ORA(Over-representation analysis0: To find representative functions of a list of genes
    FCS(Functional class scoring):
    Gene set enrichment:
    The process of discovering the common characteristics potentially, present in ln a list of genes.
    Tools: AgriGO, DAVID, Panther, goatools, ermineJ, GOrilla, ToppFun

  5. Data format
    目前生物学数据库有GenBank和NCBI
    DNA sequence数据库为INSDC(International nucleotide sequence database collaboration), 包括NCBI, EMBL, DDBJ.
    Protein sequence 数据库为UniProt(Universal protein resource)
    另外,PDB(Protein data bank) 是生物大分子3D结构信息库
    Automate data access:
    Sequenceing data formate: GenBank, FASTA, FASTQ
    FASTA 数据格式

  6. 以">" 开头

  7. ">"之后是一串字母

  8. 可能包括一些文字
    Some rules:

  9. Sequence lines should not be too long

  10. The sequence lines should wrap at the same width

  11. Use upper-case letters
    Some data of FASTA headers include structured information.
    Lower-case letters might be used to indicate repetitive regions for genome.
    FASTQ format
    分四部分:

  12. 以"@"开头

  13. 已有的顺序

  14. 符号“+”,也可能后面接与第一行一样的ID

  15. 衡量第二部分质量的字符并且与第二行长度相同

  16. How to get data
    Where to get data: NCBI, ENSEMBL, BioMart, UCSC table browser
    FASTQ manipulation
    Overview data:
    seqkit stat *.gz
    There are too many manipulatios in FASTA/Q, I only report what you can do with FASTA/Q file and the answer is in Chapter 7 of Biostar handbook.
    How to get the GC content of every sequence in a FASTA/Q file?
    How to extract a subset of sequences from a FASTA/Q file with name/ID list file?
    How to find FASTA/Q sequences containing degenerate bases and locate them?
    How to remove FASTA/Q records with duplicated sequences?
    How to locate motif/subsequence/enzyme digest sites in FASTA/Q sequence?
    How to sort a huge number of FASTA sequences by length?
    How to split FASTA sequences according to information in the header?
    How to search and replace within a FASTA header using character strings from a text file?
    How to extract paired reads from two paired-end reads files?
    How to concatenate two FASTA sequences in to one?
    You can follow the answer in biostar handbook if you want to do some thing same as above

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 图像大小调整 分为:文档大小、画布大小操作:单击菜单图像——图像大小/画布大小。根据要求设置长宽高,大小等:ctr...
    巢姑娘阅读 3,593评论 0 0
  • 愉快的周末结束了。 迎来了阳光明媚的周一,又可以和可爱的小朋友们一起上课做游戏啦! 我们一起听梦梦老师给我们讲绘本...
    apple_Mia阅读 2,555评论 0 0
  • 最近为了孩子也为自己,我和孩子一直在跟音频读经,只是我还是早起不来,所以我和孩子选择在晚上读经,我洗衣服他们自己玩...
    颜丽娜阅读 1,943评论 1 2
  • 《爱的五种语言》第44天: 服务行动 在温总和油井会所认识广仁书院的黄院长,聊天时谈到书院有义工,咨询对义工的要求...
    路西法妈妈阅读 897评论 0 0