目的:为了分析基因转录后5‘端和3’端的修饰或添加序列的情况。采用的斑马鱼线粒体基因组为标准序列
- 将去掉接头的序列文件fq.gz 变为fasta文件格式
2.由于fasta文件过大,可以切割为多个文件 - 下载blast软件,构建标准数据库
4.对多个文件进行本地blast(构建的标准数据库),生成对应的多个excel表
5.运行python程序对多个excel表进行数据处理分析,并添加5和3端序列
1.将去掉接头的序列文件fq.gz 变为fasta文件格式
import time
start = time.time()
import glob
import gzip
path = r'K:\test_clean_data\MCK_FKDL210000063-1a_1.clean.fq.gz'
for i in glob.glob(path):
with gzip.open(i,'rt') as fp:
output_fasta = open("%s.fa"%i[:-9],'w')
i = 0
for line in fp:
i += 1
if i % 4 == 1:
line_new = line[1:]
output_fasta.write('>'+line_new)
elif i % 4 == 2:
output_fasta.write(line)
output_fasta.close()
end = time.time()
print("used %s s" % str(end - start))
2. fasta文件切割为多个文件
2.1 将一个文件按每多少行切割成若干个文件
首先判断文件多少行数
wc -l MCK_FKDL210000063-1a_1.cl.fa
6521488 MCK_FKDL210000063-1a_1.cl.fa
MCK_FKDL210000063-1a_1.cl.fa 该文件有600多万行,因此选择1000000为一个文件
$ split -l 1000000 MCK_FKDL210000063-1a_1.cl.fa ./SPLIT/SPLIT.fa -d -a 4
1000000 按1000000行大小分割,可根据需要需要修改
MCK_FKDL210000063-1a_1.cl.fa 需要切割的大文件
./SPLIT/SPLIT.fa 保存文件的位置以及文件名
-d 后缀位数字 如果不加则是字母形式的后缀
-a -4 后缀数字为4位数
批量操作, ./SPLIT/SPLIT.fa 后面不能接文件,应改为文件夹
ls
*.fa ## 没有结果
ls *.fa ### 输出路径下所有的文件数
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/ -d -a 4;done
###在单个文件夹下生成
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/$id/ -d -a 4;done
###在不同文件夹下生成
ls *.fa|while read id ; do mkdir ./SPLIT/$id ;done
###先自动生成子文件夹
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/$id/ -d -a 4;done
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/$id/$id -d -a 4;done
##用这个更好
ls *.fa|while read id; do blastn -task blastn-short -query $id -out
./zebrafish_mito_blastn_excels/$id.xls -db ./zebrafish_mito_blast_databases/zebra_mito_blast -outfmt 7 -evalue 1e-5; done
2.2将一个文件按每多少kb切割成若干个文件
$ split -b 300k hibernate.log bbb
2.3 合并 -- cat命令
$ cat part_* > merge_file.txt
例子
split -l 1000000 MCK_FKDL210000063-1a_1.cl.fa ./SPLIT/SPLIT.fa -d -a 4
3.下载blast软件,构建标准数据库
3.1.本地安装blast软件
https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
mkdir BLAST
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz
tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
cd ~
vi ~/.bashrc
小写i进入编辑模式,或者按两次i。
按ESC,保存退出:wq ,不保存退出:q!
#写入PATH
source ~/.bashrc
echo $PATH
3.2. 构建数据库
首先去ncbi下载斑马鱼线粒体基因组fasta文件,命名为zebra_mito.fasta
$ makeblastdb -in zebra_mito.fasta -dbtype nucl -parse_seqids -out zebra_mito
makeblastdb -in input_file -dbtype molecule_type -title database_title -parse_seqids -out database_name -logfile File_Name
-in 后接输入文件,你要格式化的fasta序列
-dbtype 后接序列类型,nucl为核酸,prot为蛋白
-parse_seqids 推荐加上,
-out 后接数据库名
-logfile 日志文件,如果没有默认输出到屏幕
执行上述代码后将生成后缀为nin,nhr,nsq,nsi,nsd,nog等9个文件,至此,自定义搜索文件生成完成
4.对多个分割文件进行本地blast(构建的标准数据库),生成对应的多个excel表
4.1 blastn 使用库文件进行搜索了,搜索代码如下:
blastn -task blastn-short -query SPLIT.fa0000 -out seq0000.xls -db ./mito_dna_data/zebra_mito -outfmt 7 -evalue 1e-5
也可以是其他代码
$ blastn -query part.fasta -out seq.blast -db cox -outfmt 0
blastn -query part.fasta -out seq.blast -db cox -outfmt 7
$ blastp -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8
$ blastn -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8
$ blastx -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8
-query: 输入文件路径及文件名
-out:输出文件路径及文件名
-db:格式化了的数据库路径及数据库名
-outfmt:输出文件格式,总共有12种格式,6是tabular格式对应之前BLAST的m8格式
-evalue:设置输出结果的e-value值, 越少越严格
-num_alignments 显示比对数Default = 250
-num_descriptions:单行描述的最大数目 default=50
-num_threads:线程
重点是-outfmt 6,也就是之前版本的m 8格式 不同数字代表输出格式不同
-outfmt <String>
alignment view options: #####常用0,6,7
0 = Pairwise, 简单数据输出
1 = Query-anchored showing identities, 可视化图形
2 = Query-anchored no identities, 图形化比配,与2差不多
3 = Flat query-anchored showing identities, 可视化图形
4 = Flat query-anchored no identities, 可视化图形
5 = BLAST XML, 输出为XML,不建议用
6 = Tabular, 输出简单明了
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1), 不懂
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
17 = Sequence Alignment/Map (SAM),
18 = Organism Report
6的输出结果
结果中从左到右每一列的意义分别是:
[00] Query id
[01] Subject id
[02] % identity
[03] alignment length
[04] mismatches
[05] gap openings
[06] q. start
[07] q. end
[08] s. start
[09] s. end
[10] e-value
[11] bit score
4.2 极短序列比对
$ blastn -task blastn-short -query part.fasta -out seq.blast -db cox -outfmt 7 -evalue 1e-5
对于特别短的序列例如20bp(<30)左右的序列,在查找的时候我们需要添加三个参数,如下代表所示:
还是输出为txt格式,读取速度比excel快,后面的程序进行更换
-task blastn-short
-word_size 4 \
-evalue 1 \
-task参数指定为短序列的匹配算法
-word_size该参数只能指定4以上的值,这里写了最小的4,指定用于匹配算法的最短序列
-evalue 该值越小,得到的结果将更加严苛,这里设定为1即可。
$ blastn -help 可以获取帮助
4.3 批量完成程序
为了方便后续对excel的操作,生成格式改为.xlsx,,###好像还是不行,只能改为xls
for i in `seq 0 6`;do blastn -task blastn-short -query SPLIT.fa000${i} -out seq000${i}.xls -db ./mito_dna_data/zebra_mito -outfmt 7 -evalue 1e-5;done
同时也可以生成可视化文件作为对比,选择1 作为输出格式
for i in `seq 0 6`;do blastn -task blastn-short -query SPLIT.fa000${i} -out seq000${i}.blast -db ./mito_dna_data/zebra_mito -outfmt 1 -evalue 1e-5;done
5. 对多个excel表进行python分析
首先需要将xls格式的表通过另存为改为xlsx格式,否则读取报错
变为xlsx格式
import os
import os.path
import win32com.client as win32
## 根目录
rootdir = u'K:\\test_clean_data\seq3'
# 三个参数:父目录;所有文件夹名(不含路径);所有文件名
for parent, dirnames, filenames in os.walk(rootdir):
for fn in filenames:
filedir = os.path.join(parent, fn)
print(filedir)
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(filedir)
# xlsx: FileFormat=51
# xls: FileFormat=56,
# 后缀名的大小写不通配,需按实际修改:xls,或XLS
wb.SaveAs(filedir.replace('xls', 'xlsx'), FileFormat=51)
wb.Close()
excel.Application.Quit()
重新生成xlsx文件。接着对xlsx文件进行处理
from xlrd import open_workbook
import requests
import re
import json
import time
import os
import xlwt
import xlwt
import xlrd
path = r'K:\test_clean_data\seq'
path1 = r'K:\test_clean_data\MCK_FKDL210000063-1a_1.cl.fa'
seq_id=[]
seq=[]
with open(path1, 'rb') as f:
for line in f:
Line = line.decode('utf-8', 'replace')
if Line.startswith('>'):
Line1= Line.split('1:N')[0]
Line2=Line1[1:].strip()
seq_id.append(Line2)
#print(Line2)
else:
seq.append(Line)
#seq_dict=dict(zip(seq_id,seq))
try:
for root, dirs, files in os.walk(path):
style = xlwt.XFStyle()
font = xlwt.Font()
font.name = 'SimSun'
style.font = font
w = xlwt.Workbook(encoding='utf-8')
# 添加个sheet
ws = w.add_sheet('sheet 1', cell_overwrite_ok=True)
# 当前写入表格到第 row行
ws.write(0, 0, 'ID')
ws.write(0, 1, 'refer_ID0')
ws.write(0, 2, 'identity')
ws.write(0, 3, 'length')
ws.write(0, 4, 'mismatch')
ws.write(0, 5, 'gaps')
ws.write(0, 6, 'q. start')
ws.write(0, 7, 'q. end')
ws.write(0, 8, 's. start')
ws.write(0, 9, 's. end')
ws.write(0, 10, 'evalue')
ws.write(0, 11, 'score')
ws.write(0, 12, '5-seq')
ws.write(0, 13, '3-seq')
###把数据汇入一个表中
for file in files:####可以获取所有文件的路径
path = os.path.join(root, file)####单个文件的绝对路径
print(path)
workbook = xlrd.open_workbook(path)
sheet2 = workbook.sheet_by_index(0)
print(sheet2 )
cols_0 = sheet2.col_values(0)
cols_1 = sheet2.col_values(1)
cols_2=sheet2.col_values(2)
cols_3 = sheet2.col_values(3)
cols_4 = sheet2.col_values(4)
cols_5 = sheet2.col_values(5)
cols_8 = sheet2.col_values(8)
cols_9 = sheet2.col_values(9)
cols_10 = sheet2.col_values(10)
cols_11 = sheet2.col_values(11)
cols_start=sheet2.col_values(6)
cols_end = sheet2.col_values(7)
row = 1
i=0
for col in cols_1:
if col == 'NC_002333.2':
#print(col)
#ws.write(row,i,col)
print(i+1)
if cols_0[i].strip() in seq_id:
ws.write(row, 0, cols_0[i])
ws.write(row, 1, cols_1[i])
ws.write(row, 2, cols_2[i])
ws.write(row, 3, cols_3[i])
ws.write(row, 4, cols_4[i])
ws.write(row, 5, cols_5[i])
ws.write(row, 6, cols_start[i])
ws.write(row, 7, cols_end[i])
ws.write(row, 8, cols_8[i])
ws.write(row, 9, cols_9[i])
ws.write(row, 10, cols_10[i])
ws.write(row, 11, cols_11[i])
index=seq_id.index(cols_0[i])
start=int(cols_start[i])
print(start)
end=int(cols_end[i])
print(end)
tagart_seq=seq[index]
short_seq=tagart_seq[start-1:end]
short_seq_5=tagart_seq.split(short_seq)[0]
short_seq_3 = tagart_seq.split(short_seq)[1]
#print(index)
print(seq[index])
print(short_seq_5)
print(len(tagart_seq))
print(len(short_seq))
print(len(short_seq_5))
print(short_seq_3)
print(len(short_seq_3))
ws.write(row, 12, short_seq_5)
ws.write(row, 13, short_seq_3)
row=row+1
#if cols_0[i] in total_seq[m]:
#print(i+1)
#print(total_seq[m+1])
#print(total_seq[m])
##ws.write(row,i,cols_0[i+1])
i=i+1
w.save(file+'zebra_mito.xls')
except BaseException as e: ##设置避免出差错,运行不了
print('Error:', e)
生成的结果如下
后续就可以按照位置排序,把polya或polyt挑出来
##################################################################################################################
如果还想分析基因之间的相交情况
from xlrd import open_workbook
import requests
import re
import json
import time
import os
import xlwt
import xlwt
import xlrd3
path_excel = r'I:\MTO1_mito_rna_seq\MTO1_whole_adult_mito_mRNA_seq\zebrafish_mito_blastn_excels'
path2 = r'I:\MTO1_mito_rna_seq\MTO1_whole_adult_mito_mRNA_seq\zebrafish_mito_gene_id\zebrafish_mito_gene_id.xlsx'
path_fasta = r'I:\MTO1_mito_rna_seq\MTO1_whole_adult_mito_mRNA_seq\clean_data_fq_to_fa'
file_names = []
for root, dirs, files in os.walk(path_excel):
file_names.append(dirs)
print(file_names[0])
for name in file_names[0]:
path_xls = os.path.join(path_excel,name)
path_fa = os.path.join(path_fasta,name)
###########################################################
print(path_fa)
print(path_xls)
# 获取基因对应的序列区间。先构建列表
mito_plus_gene_position = []
mito_minus_gene_position = []
mito_plus_gene = []
mito_minus_gene_direction = []
workbook = xlrd3.open_workbook(path2)
sheet1 = workbook.sheet_by_index(0)
col_1 = sheet1.col_values(0)
col_2 = sheet1.col_values(1)
col_3 = sheet1.col_values(2)
col_4 = sheet1.col_values(3)
for x in col_2:
mito_plus_gene_position.append(int(x))
for x in col_1:
mito_plus_gene.append(x)
for x in col_3:
if x is not None:
mito_minus_gene_position.append(int(x))
for x in col_4:
if x is not None:
mito_minus_gene_direction.append(x)
print(mito_plus_gene)
print(mito_plus_gene_position)
print(mito_minus_gene_position)
print(mito_minus_gene_direction)
#################################################
# 构建测序文件所有测序Reads的集合,同时建立相对应的ID全集合
seq_id = []
seq = []
path_total=[]
for root, dirs, files in os.walk(path_fa):
for file in files: ####可以获取所有文件的路径
path_11 = os.path.join(root, file) ####单个文件的绝对路径#######
print(path_11)
path_total.append(path_11)
with open(path_11, 'rb') as f:
for line in f:
#print(line)
Line = line.decode('utf-8', 'replace')
#print(Line)
if Line.startswith('>'):
Line1 = Line.split('1:N')[0]
Line2 = Line1[1:].strip()
seq_id.append(Line2)
#print(Line2)
else:
seq.append(Line)
f.close()
print(len(seq_id))
#print(seq[1:100])
# seq_dict=dict(zip(seq_id,seq))
######################################################################################################################
# 遍历所有路径下的Excel表,
if len(seq_id)>0:
try:
for root, dirs, files in os.walk(path_xls):
style = xlwt.XFStyle()
font = xlwt.Font()
font.name = 'SimSun'
style.font = font
w = xlwt.Workbook(encoding='utf-8')
# 添加个sheet
ws = w.add_sheet('sheet 1', cell_overwrite_ok=True)
# 当前写入表格到第 row行
ws.write(0, 0, 'ID')
ws.write(0, 1, 'refer_ID0')
ws.write(0, 2, 'identity')
ws.write(0, 3, 'length')
ws.write(0, 4, 'mismatch')
ws.write(0, 5, 'gaps')
ws.write(0, 6, 'q. start')
ws.write(0, 7, 'q. end')
ws.write(0, 8, 's. start')
ws.write(0, 9, 's. end')
ws.write(0, 10, 'evalue')
ws.write(0, 11, 'score')
ws.write(0, 12, '5-seq')
ws.write(0, 13, '3-seq')
###把数据汇入一个表中
#print(files)
for file in files: ####可以获取所有文件的路径
row = 1
file_name=file[:-4]
print(file)
print(file_name)
path = os.path.join(root, file) ####单个文件的绝对路径
print(path)
with open(path, 'r') as file_to_read:
for line in file_to_read.readlines():
if 'MT' in line:
b = line.split()
if int(b[3]) <= 146: ##########针对750万reads数,设置147大小仍然有点大,会有几个文件写入excel表写不进去
cols_0 = b[0].strip()
cols_1 = b[1].strip()
cols_2 = b[2].strip()
cols_3 = b[3].strip()
cols_4 = b[4].strip()
cols_5 = b[5].strip()
cols_8 = b[8].strip()
cols_9 = b[9].strip()
cols_10 = b[10].strip()
cols_11 = b[11].strip()
cols_start = b[6].strip()
cols_end = b[7].strip()
if cols_0 in seq_id:
ws.write(row, 0, cols_0)
ws.write(row, 1, cols_1)
ws.write(row, 2, cols_2)
ws.write(row, 3, cols_3)
ws.write(row, 4, cols_4)
ws.write(row, 5, cols_5)
ws.write(row, 6, cols_start)
ws.write(row, 7, cols_end)
ws.write(row, 8, cols_8)
ws.write(row, 9, cols_9)
ws.write(row, 10, cols_10)
ws.write(row, 11, cols_11)
index = seq_id.index(cols_0)
# print(index)
tagart_seq = seq[index]
start = int(cols_start)
end = int(cols_end)
short_seq = tagart_seq[start - 1:end] ########匹配到的序列, 进行截取
short_seq_5 = tagart_seq.split(short_seq)[0]
short_seq_3 = tagart_seq.split(short_seq)[1]
ws.write(row, 12, short_seq_5)
ws.write(row, 13, short_seq_3)
#########################################################################
q_start = int(cols_8)
q_end = int(cols_9)
#print(q_start, q_end)
if q_start > q_end:
for pos in range(len(mito_plus_gene_position)):
# print(len(mito_plus_gene_position))
if mito_plus_gene_position[pos] <= q_start <= mito_minus_gene_position[pos]:
# print(pos)
gene_name_1 = mito_plus_gene[pos]
#print(gene_name_1)
ws.write(row, 14, gene_name_1)
break
for pos in range(len(mito_plus_gene_position)):
if mito_plus_gene_position[pos] <= q_end <= mito_minus_gene_position[pos]:
gene_name_2 = mito_plus_gene[pos]
ws.write(row, 15, gene_name_2)
ws.write(row, 16, mito_minus_gene_direction[pos])
break
if q_start < q_end:
for pos in range(len(mito_plus_gene_position)):
# print(len(mito_plus_gene_position))
if mito_plus_gene_position[pos] < q_start < mito_minus_gene_position[pos]:
# print(pos)
gene_name_1 = mito_plus_gene[pos]
# print(gene_name_1)
ws.write(row, 14, gene_name_1)
break
for pos in range(len(mito_plus_gene_position)):
if mito_plus_gene_position[pos] < q_end < mito_minus_gene_position[pos]:
gene_name_2 = mito_plus_gene[pos]
ws.write(row, 15, gene_name_2)
ws.write(row, 16, mito_minus_gene_direction[pos])
break
row = row + 1
print(row)
w.save(file_name + 'zebra_mito.xls')
except BaseException as e: ##设置避免出差错,运行不了
print('Error:', e)
else:
continue