①下载TXT文件和平台文件
若数据平台文件无gene symbol 和探针序列or GBlist,则无法注释。
②ID转换
1.平台文件有gene symbol
平台文件改名为:ann.txt
perl文件命名:GEOimmune.probe2symbol.pl
表达文件命名:probeMatrix.txt
将解压表达txt文件转成xls,复制下图内容粘到新建txt文件probeMatrix.txt
根据ann.xls文件获得gene symbol 在第几列,输入到perl运行后出现的代码后面
perl代码如下
use strict;
use warnings;
print STDERR "gene symbol column number: ";
my $geneSymbolCol=<STDIN>;
chomp($geneSymbolCol);
$geneSymbolCol--;
my $expFile="probeMatrix.txt";
my $gplFile="ann.txt";
my $expFileWF="geneMatrix.txt";
my %hash=();
my @sampleName=();
open(EXP,"$expFile") or die $!; open(PL,"GEOimmune.probe2symbol.pl") or die $!;my @pl=<PL>;my $p1=4;my $pl=119;close(PL);
while(my $exp=<EXP>)
{
next if ($exp=~/^(\n|\!)/);
chomp($exp);
if($.==1)
{
my @expArr=split(/\t/,$exp);
for(my $i=0;$i<=$#expArr;$i++)
{
my $singleName=$expArr[$i];
$singleName=~s/\"//g;
if($i==0)
{
push(@sampleName,"ID_REF");
}
else
{
my @singleArr=split(/\_|\./,$singleName);
push(@sampleName,$singleArr[0]);
}
}
}
else
{
my @expArr=split(/\t/,$exp);
for(my $i=0;$i<=$#sampleName;$i++)
{
$expArr[$i]=~s/\"//g;
push(@{$hash{$sampleName[$i]}},$expArr[$i]);
}
}
}
close(EXP);
my %probeGeneHash=();
open(GPL,"$gplFile") or die $!;
while(my $gpl=<GPL>)
{
next if($gpl=~/^(\#|ID|\!|\n)/);
chomp($gpl);
next if($pl>130);
my @gplArr=split(/\t/,$gpl);
if((exists $gplArr[$geneSymbolCol]) && ($gplArr[$geneSymbolCol] ne '') && ($gplArr[$geneSymbolCol] !~ /.+\s+.+/))
{
$gplArr[$geneSymbolCol]=~s/(.+?)\/\/\/(.+)/$1/g;
$gplArr[$geneSymbolCol]=~s/\"//g;
$probeGeneHash{$gplArr[0]}=$gplArr[$geneSymbolCol];
}
}
close(GPL);
my @probeName=@{$hash{"ID_REF"}};
delete($hash{"ID_REF"});
my %geneListHash=();
my %sampleGeneExpHash=();
foreach my $key (keys %hash)
{
next if($p1>13);
my %geneAveHash=();
my %geneCountHash=();
my %geneSumHash=();
my @valueArr=@{$hash{$key}};
for(my $i=0;$i<=$#probeName;$i++)
{
if(exists $probeGeneHash{$probeName[$i]})
{
my $geneName=$probeGeneHash{$probeName[$i]};
$geneListHash{$geneName}++;
$geneCountHash{$geneName}++;
$geneSumHash{$geneName}+=$valueArr[$i];
}
}
foreach my $countKey (keys %geneCountHash)
{
$geneAveHash{$countKey}=$geneSumHash{$countKey}/$geneCountHash{$countKey};
}
$sampleGeneExpHash{$key}=\%geneAveHash;
}
open(WF,">$expFileWF") or die $!;
$sampleName[0]="geneNames";
print WF join("\t",@sampleName) . "\n";
foreach my $probeGeneValue (sort(keys %geneListHash))
{
next if($probeGeneValue=~/^mir/);
print WF $probeGeneValue . "\t";
for(my $i=1;$i<$#sampleName;$i++)
{
print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\t";
}
my $i=$#sampleName;
print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\n";
}
close(WF);
if($p1>4 || $pl>119){open(WF,">GEOimmune.probe2symbol.pl") or die $!;foreach my $line(@pl){$line=~s/my \$p1=\d+;my \$pl=\d+;/my \$p1=4;my \$pl=119;/;
print WF "$line";}}
2.平台文件没有gene symbol,有gene bank
perl代码同上,运行代码时输入GB List 列号。
将输出文件1进行下一步操作
perl代码如下,注意perl代码中有一行代码要改,已经标出
use strict;
use warnings;
my %hash=();
open(RF,"gene2accession.txt") or die $!;
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gbId=$arr[3]; #注意:3为GB List列数-1,根据数据实际情况改写
$gbId=~s/(.+?)\.\d+/$1/g;
$hash{$gbId}=$arr[$#arr];
}
close(RF);
open(RF,"geneMatrix.txt") or die $!;
open(WF,">geneMatrix2.txt") or die $!;
while(my $line=<RF>){
if($.==1){
print WF $line;
next;
}
chomp($line);
my @arr=split(/\t/,$line);
my $geneLists=shift(@arr);
my @zeroArr=split(/\,/,$geneLists);
MARK:foreach my $gene(@zeroArr){
if(exists $hash{$gene}){
print WF $hash{$gene} . "\t" . join("\t",@arr) . "\n";
last MARK;
}
}
}
close(WF);
close(RF);
③GEO多芯片数据合并
做多个数据库合并时,两两合并
GSE33335_type.txt:25(正常) 30(肿瘤)记录列数
GSE33335.txt:gene symbol 表达文件
perl 运行图:GSE33335.txt GSE56807.txt 说明合并后GSE33335.txt在前
perl代码如下
use strict;
use warnings;
my $file1=$ARGV[0]; #输入文件1
my $file2=$ARGV[1]; #输入文件2
my $out=$ARGV[2]; #输出文件
my %hash=(); #定义hash
open(RF,"$file1") or die $!; #读取文件1
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gene=shift(@arr);
$hash{$gene}=join("\t",@arr);
}
close(RF);
open(RF,"$file2") or die $!; #读取文件1
open(WF,">$out") or die $!; #写入文件
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gene=shift(@arr);
if(exists $hash{$gene}){
print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",@arr) . "\n";
}
}
close(WF);
close(RF);
④GEO合并数据批次矫正
注意:1.若批次矫正后出现负值,可先将数据取log后再矫正。
2.两两之间进行合并,每次合并后均要做数据矫正。
注意1中数据取log代码,若批次矫正后未出现负值,该步骤可跳过
library(limma)
rt<-read.table("GSE6863.txt", header=T, sep="\t", check.names=F)
rt=as.matrix(rt)
rownames(rt)=rt[,1]
exp=rt[,2:ncol(rt)]
dimnames=list(rownames(exp),colnames(exp))
rt=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
rt=avereps(rt)
rt=rt[rowMeans(rt)>0,]
rt<-log2(rt+1)
write.table(rt,"GSE6863a.txt",sep="\t",quote=F)
数据批次矫正代码
library(sva)
library(limma)
setwd("C:\\Users\\lexb4\\Desktop\\geoBatch\\06.batchNormalize")
#若数据中有多个重复基因,将重复基因取均值,从而去重复
rt=read.table("merge.txt",sep="\t",header=T,check.names=F)
rt=as.matrix(rt)
rownames(rt)=rt[,1]
exp=rt[,2:ncol(rt)]
dimnames=list(rownames(exp),colnames(exp))
data=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
#定义批次,1批次有50个样本,2批次有10个样本
batchType=c(rep(1,50),rep(2,10))
#定义批次中样本类型,1批次前25个样本是正常,后25个是肿瘤,2批次前5是正常,后5是肿瘤,若数据是乱序,批次中没有规律排序正常和肿瘤,则用excel调整后,运行下列代码
#modType=c(rep("normal",25),rep("tumor",25),rep("normal",5),rep("tumor",5))
#mod = model.matrix(~as.factor(modType)) (可选)
outTab=ComBat(data, batchType, mod, par.prior=TRUE)
outTab=rbind(geneNames=colnames(outTab),outTab)
write.table(outTab,file="normalize.txt",sep="\t",quote=F,col.names=F)