一直都感觉模模糊糊,先把弄明白的写下来吧
本体论就相当于给一个事物或者现象一个确定的命名好让所有人都用这一个词来描述这一事物或现象以免使别人疑惑---也就是制定术语(term)。本体论分为SO 和GO, SO 是给sequence feature命名, GO是给基因功能命名
基因本体论:
连接基因与它的一个或多个功能
分三部分:cellular component: where does the product exhibit its effect
molecular function: how does it work
biological process:ehat is the propose of the gene product
基因本体论是个有向环,一个点可以和多个点有关联。
GO data:
It contain gene ontology definition file and a gene association file
GO assocaition file format: GAF format
Functional analysis:
ORA(Over-representation analysis0: To find representative functions of a list of genes
FCS(Functional class scoring):
Gene set enrichment:
The process of discovering the common characteristics potentially, present in ln a list of genes.
Tools: AgriGO, DAVID, Panther, goatools, ermineJ, GOrilla, ToppFunData format
目前生物学数据库有GenBank和NCBI
DNA sequence数据库为INSDC(International nucleotide sequence database collaboration), 包括NCBI, EMBL, DDBJ.
Protein sequence 数据库为UniProt(Universal protein resource)
另外,PDB(Protein data bank) 是生物大分子3D结构信息库
Automate data access:
Sequenceing data formate: GenBank, FASTA, FASTQ
FASTA 数据格式以">" 开头
">"之后是一串字母
可能包括一些文字
Some rules:Sequence lines should not be too long
The sequence lines should wrap at the same width
Use upper-case letters
Some data of FASTA headers include structured information.
Lower-case letters might be used to indicate repetitive regions for genome.
FASTQ format
分四部分:以"@"开头
已有的顺序
符号“+”,也可能后面接与第一行一样的ID
衡量第二部分质量的字符并且与第二行长度相同
How to get data
Where to get data: NCBI, ENSEMBL, BioMart, UCSC table browser
FASTQ manipulation
Overview data:
seqkit stat *.gz
There are too many manipulatios in FASTA/Q, I only report what you can do with FASTA/Q file and the answer is in Chapter 7 of Biostar handbook.
How to get the GC content of every sequence in a FASTA/Q file?
How to extract a subset of sequences from a FASTA/Q file with name/ID list file?
How to find FASTA/Q sequences containing degenerate bases and locate them?
How to remove FASTA/Q records with duplicated sequences?
How to locate motif/subsequence/enzyme digest sites in FASTA/Q sequence?
How to sort a huge number of FASTA sequences by length?
How to split FASTA sequences according to information in the header?
How to search and replace within a FASTA header using character strings from a text file?
How to extract paired reads from two paired-end reads files?
How to concatenate two FASTA sequences in to one?
You can follow the answer in biostar handbook if you want to do some thing same as above