相似的基因在不同物种中,其功能往往是保守的。通过GO注释可获得跨物种的同源基因及其基因产物的功能描述。对于非模式生物或者无参考基因组的项目,经常需要进行背景基因的GO注释,以便后期差异基因功能富集或者整体背景基因功能了解。
这里提供一种本地无参物种背景基因GO注释,依赖Uniprot2GO脚本。仅做记录。
参考:
https://www.jianshu.com/p/ea2da43b52b9
https://www.thinbug.com/q/51173860
https://blog.csdn.net/ygyxl/article/details/79742751
准备文件:
(1)swissprot数据库文件
(2)idmapping.tb.gz文件和Uniprot2GO_annotated.py文件
准备工具:
diamond, TBtools
一、diamond blast比对
-
前提通过Trinity组装得到的Unigene文件,如下所示
- swissprot数据库比对
(1)下载swissprot数据库
ascp下载(推荐)
~/.aspera/connect/bin/ascp -QT \
-i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
-k1 -l 300m \
anonftp@ftp.ncbi.nlm.nih.gov:/blast/db/FASTA/swissprot.gz ./
wget下载
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/swissprot.gz
(2)diamond建库
diamond makedb --in swissprot.gz -d diamond_swiss -p 24
(3)blast比对:e-value设置为1e-5,输出最好hits
diamond blastx --max-hsps 1 --max-target-seqs 1 \
-d ~database/swissport/diamond_swiss -q ~/Unigene.fa \
-o swiss_match_out6 --sensitive --outfmt 6 --evalue 1e-5 -p 20
得到的swiss_match_out6文件,如下所示,第二列为Uniprot ID:
二、GO注释
- 获取idmapping.tb.gz文件(~10G)
(1)Uniprot官网获取:(最新)
https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/idmapping/idmapping_selected.tab.gz
(2)TBtools分享
https://tbtools.cowtransfer.com/s/566e88227a0045
本地下载后上传服务器,得到idmapping文件,以Tab分隔,第1列为Uniprot ID,第8列为GOID
- 修改swissprot比对得到文件swiss_match_out6
GO注释只需要前2列,即Uniprot ID,并且第2列只取“.”前的字符,通过以下得到:
awk -F '.' '$2=$2' OFS="\t" swiss_match_out6 > swiss_out6_tmp.txt
cut -f 1,2 swiss_out6_tmp.txt > swiss_out6.txt
- 修改Uniprot2GO_annotated.py文件
(1)第7行修改为:
print (USAGE)
(2)由idmapping.tb.gz文件格式可知:
为swissprot比对结果做GO注释时,第16行应改为:
UniProt_GO[lsplit[0]] =lsplit[7]
为NR比对结果作GO注释时,第16行应改为:
UniProt_GO[lsplit[3]] =lsplit[7]
(3)如果报lsplit = line.rstrip().split("\t") TypeError: a bytes-like object is require错误,第14行需要加上decode()函数:
lsplit =line.decode().rstrip().split("\t")
- 运行Uniprot2GO_annotated.py文件进行GO注释:
python Uniprot2GO_annotated.py idmapping.tb.gz swiss_out6.txt swiss_go.out
得到GO注释文件,一行对应多行,如下所示:
如果要改成一行对应一行,可以通过下面的方式:
(1)
while read line;do
left=$(echo $line|grep -oE '^[^ ]+ +') #the left part + a blank
echo $line |
grep -oE '[^ ]+$' | #take the right part
sed -r "s/([^,]+),?/$left\1\n/g" |
#prefix every GO::, with the left part and go back to line
grep 'c' | #remove the empty line added by the very last group
tee -a swiss_go_single.out #输出文件名
done< swiss_go.out
(2)
while read line
do
var1=$(echo $line | awk '{print $1}') # assign first field to var1
Arrayvals=($(echo $line | awk '{print $2}' | sed -e 's/,/ /g'))
# create an array from second filed
for (( i=0; i < ${#Arrayvals[@]} ; i++ )) # iterate the array using a for loop , ${#Arrayvals[@]} -> gives the length of array
do
echo "${var1} ${Arrayvals[${i}]}" # echo in desired format
done
done < swiss_go.out > swiss_go_single.out
附:Uniprot2GO_annotated.py修改后
import sys
import gzip
USAGE = "\nusage: python %s idmapping.tb.gz blastout outputfile\n" % sys.argv[0]
if len(sys.argv) != 4:
print (USAGE)
sys.exit()
def parseIDmapping(filename):
UniProt_GO = {}
with gzip.open(filename, 'r') as f:
for line in f:
lsplit = line.decode().rstrip().split("\t")
if lsplit[7]:
UniProt_GO[lsplit[0]] = lsplit[7]
return UniProt_GO
def parseBlastOut(filename):
tab_res = {}
with open(filename, 'r') as f:
for line in f:
lsplit = line.split()
tab_res.setdefault(lsplit[0], set()).add(lsplit[1])
return tab_res
UniProtKB_GO = parseIDmapping(sys.argv[1])
BlastOut = parseBlastOut(sys.argv[2])
OUT = open(sys.argv[3], 'w')
for i in BlastOut:
temp = []
for j in BlastOut[i]:
if j in UniProtKB_GO:
go = UniProtKB_GO[j].split("; ")
temp = temp + go
else:
continue
if temp:
OUT.write(i + "\t" + ",".join(set(temp)) + "\n")
OUT.close()