前言
需求:N个fasta文件,里面的序列可能有大量的重复,但是其header并不一定相同,需要把他们合并并去重。
方法一
首先用cat命令将他们合并
cat *.fasta > merge.fasta
perl合并
perl rmfastadup.pl merge.fasta nodup.fasta
rmfastadup.pl 代码如下:
#! /usr/bin/perl
# Downloaded from http://www.bioinformatics-made-simple.com
# 去除fasta文件中重复的序列
# 用法:perl rmfastadup.pl input,fasta output.fasta
use Bio::Perl;
my $infile1 = $ARGV[0];
my %FIRSTSEQ;
my $total = 0;
my $total1=0;
my $dup = 0;
open (OUTIE, ">$ARGV[1]");
$infile1 = Bio::SeqIO -> new('-format'=>'Fasta',-file => $infile1);
#read in all fasta sequences
while ((my $seqobj = $infile1->next_seq()))
{
$rawid = $seqobj -> display_id;
$seq = $seqobj -> seq;
#print "ID is $rawid\n";
#print "Sequence is $seq\n";
if(defined($FIRSTSEQ{$seq}))
{
print "Key match with $FIRSTSEQ{$seq}\n";
$dup++;
}
else
{
$FIRSTSEQ{$seq} = $rawid;
$total++;
}
}
while ( ($key, $value) = each(%FIRSTSEQ) )
{
print OUTIE ">$value\n";
print OUTIE "$key\n\n";
$total1++;
}
print "\n$total unduplicated sequences in the file\n";
print "$dup duplicated sequences in the file\n";
print "$total1 unique sequences printed out.\n";
close(OUTIE);
这个方法在fasta文件数量少、序列少的情况下比较简单好用,但是序列数多了以后就会贼慢,让人受不了,亲测10G的fasta整整跑了四个多小时,吐血。
方法二
大量的序列,必须通过多线程的编程来提高效率,下面提供了一个perl代码,是上面的多线程版本。
用法
perl p-rmfastadup.pl file1.fasta file2.fasta <...>
最后会生成nodup.fasta文件。
另外,核心数可以自行修改,具体看代码中的注释吧
p-rmfastadup.pl的代码如下
#! /usr/bin/perl
#同时处理X个fasta文件,最终去除这些fasta文件中序列相同的(header可以不同,一概被认为是同一序列)序列,最终输出无重复的fasta文件:nodup.fasta
#usage: perl p-rmfastadup.pl file1.fasta file2.fasta <...>
use threads;
use threads::shared;
use Bio::Perl;
my %FIRSTSEQ : shared;
my $total : shared;
my $total1 : shared;
my $dup : shared;
my @files : shared;
@files = @ARGV;
# 创建线程,可以自行增减
my $t1 = threads->create(\&checkfasta);
sleep(0.1);
my $t2 = threads->create(\&checkfasta);
my $t3 = threads->create(\&checkfasta);
my $t4 = threads->create(\&checkfasta);
my $t5 = threads->create(\&checkfasta);
my $t6 = threads->create(\&checkfasta);
my $t7 = threads->create(\&checkfasta);
my $t8 = threads->create(\&checkfasta);
my $t9 = threads->create(\&checkfasta);
my $t10 = threads->create(\&checkfasta);
# 收割线程,需要配合上面的线程
$t1->join;
$t2->join;
$t3->join;
$t4->join;
$t5->join;
$t6->join;
$t7->join;
$t8->join;
$t9->join;
$t10->join;
sub checkfasta {
while (@files) {
my $infile1 = shift @files;
print $infile1."\n";
$infile1 = Bio::SeqIO -> new('-format'=>'Fasta',-file => $infile1);
#read in all fasta sequences
while ((my $seqobj = $infile1->next_seq()))
{
$rawid = $seqobj -> display_id;
$seq = $seqobj -> seq;
#print "ID is $rawid\n";
#print "Sequence is $seq\n";
if(defined($FIRSTSEQ{$seq}))
{
print "Key match with $FIRSTSEQ{$seq}\n";
$dup++;
}
else
{
$FIRSTSEQ{$seq} = $rawid;
$total++;
}
}
}
}
#checkfasta();
open (OUTIE, ">nodup.fasta");
while ( ($key, $value) = each(%FIRSTSEQ) )
{
print OUTIE ">$value\n";
print OUTIE "$key\n\n";
$total1++;
}
print "$total1 unique sequences printed out.\n";
close(OUTIE);
2020.6.26更新
没事干造什么轮子,已经有人写好啦!用seqkit完美解决需求!