这个应用场景挺多的
比如我想提取出注释信息中previous_id
对应的基因ID,乍一看以为可以用cut命令一点一点切出来,但是又发现previous_id
并不都是紧挨着ID
列,并且derived_from
对应的ID也需要提取出来。这个时候只能用到正则匹配了。
Perl
$ cat test.pl
#! /usr/bin/perl
use warnings;
use strict;
my $file_name = $ARGV[0];
open my $in_fh, "<", "$file_name";
while (<$in_fh>) {
chomp $_;
my @one_line = (split("\t", $_));
my @old_name = ( $one_line[8] =~ /TraesCS[1-7][A-D]01G[0-9]{6}[CHL]{0,2}/g );
print "@old_name\n";
}
close $in_fh;
$ perl test.pl test.gff3 | head -n 5
TraesCS1A01G000100
TraesCS1A01G000200
TraesCS1A01G000300
TraesCS1A01G000400
TraesCS1A01G000500
需要注意的是这一句:my @old_name = ( $one_line[8] =~ /TraesCS[1-7][A-D]01G[0-9]{6}[CHL]{0,2}/g );
我目前的认识是必须用数组来接收,并且加上g
,不管是匹配一次还是多次。
R
在R里面操作很简单,用到的是stringr包。
library(tidyverse)
a <- read.table("test.gff3",header = F,sep = "\t")
b <- as.character(a$V9)
#提取出第一次匹配的内容
c <- str_extract(b,"TraesCS[1-7][ABD]01G[0-9]{6}[CHL]{0,2}")
#提取出所有匹配的内容
#以矩阵形式返回所有匹配到的内容,并将每一行元素个数统一,不够的用""空字符串表示
d <- str_extract_all(b,"TraesCS[1-7][ABD]0[12]G[0-9]{6}[CHL]{0,2}",simplify = T) #此处的正则表达式有小改动,以便演示能匹配到多个的情况
> head(c)
[1] "TraesCS1A01G000100" "TraesCS1A01G000200" "TraesCS1A01G000300" "TraesCS1A01G000400" "TraesCS1A01G000500" "TraesCS1A01G000600"
> head(d)
[,1] [,2] [,3]
[1,] "TraesCS1A02G000100" "TraesCS1A01G000100" "TraesCS1A02G000100"
[2,] "TraesCS1A02G000200" "TraesCS1A01G000200" "TraesCS1A02G000200"
[3,] "TraesCS1A02G000300" "TraesCS1A01G000300" "TraesCS1A02G000300"
[4,] "TraesCS1A02G000400" "TraesCS1A01G000400" "TraesCS1A02G000400"
[5,] "TraesCS1A02G000500" "TraesCS1A01G000500" "TraesCS1A02G000500"
[6,] "TraesCS1A02G000600" "TraesCS1A01G000600" "TraesCS1A02G000600"
Python
$ cat test.py
import re
for line in open('./test.gff3'):
all = re.findall("TraesCS[1-7][A-D]0[12]G[0-9]{6}[CHL]{0,2}", line)
for i in all:
print(i,end="\t")
print()
$ python3 test.py
TraesCS1A02G000100 TraesCS1A01G000100 TraesCS1A01G000200 TraesCS1A02G000100
TraesCS1A02G000200 TraesCS1A01G000200 TraesCS1A02G000200
TraesCS1A02G000300 TraesCS1A01G000300 TraesCS1A02G000300
TraesCS1A02G000400 TraesCS1A01G000400 TraesCS1A02G000400
TraesCS1A02G000500 TraesCS1A01G000500 TraesCS1A02G000500