「蛋白注释」- 批量InterProScan注释

Mar 20,2020 更新！

什么是InterProScan数据库

InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium. We combine protein signatures from these member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool.

可以看出InterPro数据库主要通过蛋白序列信息将蛋白分类，分为一个个家族。是一个很强大的数据库，具体的算法和实现逻辑不阐述，主要是我也不懂😂🤣，这里主要介绍这个数据库的使用方法。
主要有三种方式：

在线网站注释，「InterPro 」根据网站提示一步一步就会得到结果，这个也没啥好讲的，唯一的缺点就是没有办法批量做，每次只能输入一个蛋白质序列。
本地InterProScan：参考「Kai大神的blog」这个blog中写的很清楚，但是太耗存储了，所以我也放弃了
利用EMBL-EBI，（最好搭梯子访问）这个官方的黑科技很赞，下面主要说一下怎么用这玩意（Macos下，linux下应该也一样，win不太清楚）

重新看了脚本发现：

# General options
parser.add_option('-h', '--help', action='store_true', help='Show this help message and exit.')
parser.add_option('--email', help='E-mail address.')
parser.add_option('--title', help='Job title.')
parser.add_option('--outfile', help='File name for results.')
parser.add_option('--outformat', help='Output format for results.')
parser.add_option('--asyncjob', action='store_true', help='Asynchronous mode.')
parser.add_option('--jobid', help='Job identifier.')
parser.add_option('--polljob', action="store_true", help='Get job result.')
parser.add_option('--pollFreq', type='int', default=3, help='Poll frequency in seconds (default 3s).')
parser.add_option('--status', action="store_true", help='Get job status.')
parser.add_option('--resultTypes', action='store_true', help='Get result types.')
parser.add_option('--params', action='store_true', help='List input parameters.')
parser.add_option('--paramDetail', help='Get details for parameter.')
parser.add_option('--multifasta', action='store_true', help='Treat input as a set of fasta formatted sequences.')
parser.add_option('--useSeqId', action='store_true', help='Use sequence identifiers for output filenames.'
                                                          'Only available in multi-fasta and multi-identifier modes.')
parser.add_option('--maxJobs', type='int', help='Maximum number of concurrent jobs. '
                                                'Only available in multifasta or list file modes.')

parser.add_option('--quiet', action='store_true', help='Decrease output level.')
parser.add_option('--verbose', action='store_true', help='Increase output level.')
parser.add_option('--version', action='store_true', help='Prints out the version of the Client and exit.')
parser.add_option('--debugLevel', type='int', default=debugLevel, help='Debugging level.')
parser.add_option('--baseUrl', default=baseUrl, help='Base URL for service.')

(options, args) = parser.parse_args()

存在--multifasta的参数，也就是说没必要把fasta成一个一个的。果然在后续的代码中看到：

def multipleServiceRun(email, title, params, useSeqId, maxJobs, outputLevel):
    seqs = params['sequence']
    seqs = seqs.split(">")[1:]
    i = 0
    j = maxJobs
    done = 0
    jobs = []
    while done < len(seqs):
        c = 0
        for seq in seqs[i:j]:
            c += 1
            params['sequence'] = ">" + seq
            if c <= int(maxJobs):
                jobId = serviceRun(options.email, options.title, params)
                jobs.append(jobId)
                if outputLevel > 0:
                    if useSeqId:
                        print("Submitting job for: %s" % str(seq.split()[0]))
                    else:
                        print("JobId: " + jobId, file=sys.stderr)
        for k, jobId in enumerate(jobs[:]):
            if outputLevel > 0:
                print("JobId: " + jobId, file=sys.stderr)
            else:
                print(jobId)
            if useSeqId:
                options.outfile = str(seqs[i + k].split()[0])
            getResult(jobId)
            done += 1
            jobs.remove(jobId)
        i += maxJobs
        j += maxJobs
        time.sleep(pollFreq)

其实需要这几个email, title, params, useSeqId, maxJobs, multifasta参数就可以了。

email 没啥好解释的,填写自己的email就可以了。【必写】
useSeqId 参数解释说是用序列标识（fasta文件">"后面的内容，通俗说就是基因名字）来作为输出文件的文件名【选写】【推荐】
maxJobs 每次提交jobs数量，不建议太大，我尝试的25没有问题，这个数量和你要做interproscan的基因总数无关，可以不拆分fa文件，他会每次25个的给你提交。【推荐写上】
multifasta 这个是批量做最主要的参数，如果做一个就是sequence批量就选multifasta后面直接跟你要做InterProScan的蛋白序列。序列格式自己看脚本文件，开头有说的。

🔥综上所述我个人建议💢快速批量💯InterProScan的代码就是🔥：

python ~/gitee/MyScript/iprscan5.py --multifasta itb.fa --maxJobs 25 --useSeqId --email myemail@126.com --outformat tsv &

-outformat tsv 表示只输出tsv文件，这样速度最快，如果输出html等文件可能出现报错..还有一点tsv文件可以通过cat命令直接合并，最后统一处理。

---------------------------------❌<下面步骤不推荐!>❌-------------------------------------

步骤❌

这里直接搬运过来了

The Representational State Transfer (REST) sample clients are provided for a number of programming languages. For details of how to use these clients, download the client and run the program without any arguments.

Language	Download	Requirements
Perl	iprscan5.pl	LWP and XML::Simple
Python	iprscan5.py	xmltramp2

For details see Environment setup for REST Web Services and Examples for Perl REST Web Services Clients pages.

Perl用不明白，所以打算用Python的脚本，首先就是先下载上面的iprscan5.py，但是想用这个脚本需要先安装依赖：xmltramp2

pip install xmltramp2==3.0.10

其实这个脚本也是一个一个提交到网站做InterProScan的，但是你可以自己写个python或者shell script实现批量化操作，但是需要提前把fasta文件分割成一条序列一个文件，这里拿来主义用的别人现成的脚本：参考blog中有下载链接

## 首先看下一共有几条序列
grep '>' 01.lp.sss.candidate.fa | wc -l
# 279 ## 共有279条序列，那么就切成279个文件
perl fasta-splitter.pl --n-parts 279 01.lp.sss.candidate.fa
# 01.lp.sss.candidate.fa: 279 sequences, 122269 bp => dividing into 279 parts ...................................................................................................................................................................................................................................................................................... OK
# All done, 0 seconds elapsed
# 01.lp.sss.candidate.part-128.fa 01.lp.sss.candidate.part-269.fa
# 01.lp.sss.candidate.part-129.fa 01.lp.sss.candidate.part-270.fa
# 这里有点问题一共最后分成了278个，少一个，打算揪出这个家伙。
for i in `ls *.part*`; do grep '>' $i > $i.checkWrong.xls; done
wc -l *.checkW*
## 2 01.lp.sss.candidate.part-270.fa.checkWrong.xls 
## 270这个有两个
rm *.checkW*
less *270*
# >g35074 
# seq...
# >g47610
# seq...
# 手动分开吧，没办法
vim *270
# 复制第二个序列粘贴到新的01.lp.sss.candidate.part-270.2.fa里面
vim 01.lp.sss.candidate.part-270.2.fa
mv 01.lp.sss.candidate.part-270.fa 01.lp.sss.candidate.part-270.1.fa
## 这样顺序不会出错

## 强迫症，想把文件改成序列ID写个脚本搞定他
## 1. 做个配置文件
ls *.part* > filenameOld ##先把旧文件的文件名保存到filenameOld中，
grep '>' 01.lp.sss.candidate.fa >seqID ## 再将序列名字保存到seqID中，vim打开用 :1,$s/>//，删掉>
## double check 序列和配置文件顺序
for i in `ls *.part*`; do grep '>' $i >> checkID; done
paste checkID seqID > doublecheckSeqOrder
## 可以手动检查也可以用awk检查
awk '{if ($1 == $2) print "True"; else print "False"}' doublecheckSeqOrder > checkresult
cat checkresult | sort |uniq -c
#  279 True
paste checkID filenameOld > ChangeName.config
# 删掉 > 还是喜欢在vim下修改： :1,$s/>//g
vim ChangeName.sh
## 脚本内容如下：
while read id
do
    sample=$(echo $id |cut -d" " -f 1 ) 
    oldfile=$(echo $id |cut -d" " -f 2 ) 
    echo $sample
    echo $oldfile
    
mv ${oldfile} ${sample}.fa
done <$1
## 加执行权限
chmod +x ChangeName.sh
sh ChangeName.sh ChangeName.config
## done

单个序列测试

# 先跑一个测试下
python ~/10.InterproScan/iprscan5.py --sequence ../01.SeqSplit/g10038.fa --email myemail@126.com
JobId: iprscan5-R20200128-131432-0055-48082224-p2m
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#FINISHED
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.out.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.log.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.tsv.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.xml.xml
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.htmltarball.html.tar.gz
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.gff.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.svg.svg
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.sequence.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.json.txt

批量进行

## 配置文件
ls ../02.SeqSplit/*.fa > config
## 脚本
while read id
do
    file=$(basename $id ) 
    sample=${file%%.*} 
    echo $id $sample 
    python  ~/10.InterproScan/iprscan5.py --sequence $id --email shawnwang2016@126.com --outfile ${sample}
done <$1
## 执行
chmod +x BatchInterProScan.sh
nohup sh BatchInterProScan.sh config &
## done

当然也可以限定输出形式，--outfmt具体vim 一下iprscan5.py都有，然后就是漫长的等待结果了。

❀❀完了，碎一觉明天看结果❀❀

「蛋白注释」- 批量InterProScan注释