提取fasta文件特定ID序列,及其计算序列长度

参考:

https://www.biostars.org/p/127141/
快速计算fasta序列长度的方法


Tips:

有时候需要通过具体的ID,获取序列信息,及其计算序列长度.
(下面主要通过linux 相关命令来实现)

### 测试数据:
$ cat >test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

$ cat >IDs.txt
blah_C4
blah_C5

Part1: 提取特定ID的序列

  1. You can do that with BBTools: 里面好多java 软件包,都被shell 封装好了.
$ ./filterbyname.sh 

Written by Brian Bushnell
Last modified September 1, 2016

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

Parameters:
include=f       Set to 'true' to include the filtered names rather than excluding them.
substring=f     Allow one name to be a substring of the other, rather than a full match.
                   f: No substring matching.
                   t: Bidirectional substring matching.
                   header: Allow input read headers to be substrings of names in list.
                   name: Allow names in list to be substrings of input read headers.
prefix=f        Allow names to match read header prefixes.
case=t          (casesensitive) Match case also.
ow=t            (overwrite) Overwrites files that already exist.
app=f           (append) Append to files that already exist.
zl=4            (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f           (interleaved) Determines whether INPUT file is considered interleaved.
names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).
minlen=0        Do not output reads shorter than this.
ths=f           (truncateheadersymbol) Ignore a leading @ or > symbol in the names file.
tws=f           (truncatewhitespace) Ignore leading or trailing whitespace in the names file.
truncate=f      Set both ths and tws at the same time.

Positional parameters:
These optionally allow you to output only a portion of a sequence.  Zero-based, inclusive.
Intended for a single sequence and include=t mode.
from=-1         Only print bases starting at this position.
to=-1           Only print bases up to this position.
range=          Set from and to with a single flag.


Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
                    The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

To read from stdin, set 'in=stdin'.  The format should be specified with an extension, like 'in=stdin.fq.gz'
To write to stdout, set 'out=stdout'.  The format should be specified with an extension, like 'out=stdout.fasta'

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

太长了,直接看例子:

### 程序运行:
$ ./filterbyname.sh  in=test.fa names=IDs.txt out=out.fa include
/public/home/kcao/tools/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx13206m -cp /public/home/kcao/tools/bbmap/current/ driver.FilterReadsByName in=test.fa names=IDs.txt out=out.fa include
Executing driver.FilterReadsByName [in=test.fa, names=IDs.txt, out=out.fa, include]

Input is being processed as unpaired
Time:                           0.098 seconds.
Reads Processed:           5    0.05k reads/sec
Bases Processed:          87    0.00m bases/sec
Reads Out:          2
Bases Out:          27
### 查看结果:
[01:21:30] kcao@login:~/tools/bbmap
$ cat out.fa 
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

2.1 awk 也可以搞定

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

2.2 不完善版本:

$ awk -v RS='>' 'match($0, "blah_C1"){print ">"$0}' test.fa 
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG

3.Bioawk 也可以

$ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

4.UCGS utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in "main.fasta"

  1. python(不完善:每次提取一个ID)
$ cat >python_extract_ID.py
import sys
name = sys.argv[2]
flag = 0
with open(sys.argv[1]) as f:
    for line in f:
        line = line.strip()
        if line[0] == '>':
            if line.split()[0][1:] == name:
                flag = 1
                print(line)
            else :
                flag = 0
        else:
            if flag == 0:
                pass
            else:
                print(line)

$ python python_extract_ID.py test.fa "blah_C1"
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
小结:个人推荐awk 那个代码,修改两个参数就可以跑。

Part2 : 计算序列长度

  • awk
$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa 
>blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2 ACTTTATATATT
>blah_C3 ACTTATATATATATA
>blah_C4 ACTTATATATATATA
>blah_C5 ACTTTATATATT

awk 里面length 函数.

$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa | awk '{print $1"\t"length($2)}'
>blah_C1    33
>blah_C2    12
>blah_C3    15
>blah_C4    15
>blah_C5    12

欢迎评论补充~

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 218,546评论 6 507
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,224评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,911评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,737评论 1 294
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,753评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,598评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,338评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,249评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,696评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,888评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,013评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,731评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,348评论 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,929评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,048评论 1 270
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,203评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,960评论 2 355

推荐阅读更多精彩内容