软件——scisorseqr:https://github.com/noush-joglekar/scisorseqr
前期工作准备:
1.二代数据跑完cellranger,得到barcode、feature、matrix文件。输入到seurat进行数据质控、细胞分群、细胞类型注释
2.如果没有二代数据,只有三代数据,也可以跑sicelore、Isosceles等软件。得到[isoform x cell] matrix后输入到seurat。
安装
软件依赖:
- STARlong for PacBio reads
- Minimap2 for Oxford Nanopore (or PacBio) reads
- samtools
- bedtools
- python version >=3.7
- r: devtools,Seurat,tidyr,dplyr
devtools::install_github('noush-joglekar/scisorseqr',build_vignettes = FALSE)
# 教程教的是build_vignettes = TRUE,但是一般都是在linux下跑这个软件,build_vignettes需要打开浏览器。就很麻烦。github里面自带教程,可以直接下,没必要再生成
# https://github.com/noush-joglekar/scisorseqr/blob/master/vignettes/scisorseqr-StandardWorkflow.pdf
# 如果你在windows的R下载的话,可以选择build_vignettes = TRUE
# 但有可能会报错:pdflatex: command not found
报错的话安装以下:
install.packages("tinytex")
tinytex::install_tinytex()
虽然有教程,但是还是有点简略。建立起自己的认识,还是写一写。预先告知,我跑到第四步就跑不下去了,后面几步我也不知道会遇到什么报错结果orz
第一步: Barcode deconvolution
library(Seurat)
library(tidyr)
library(dplyr)
library(scisorseqr)
# 读入前期细胞注释好的seurat object
sce.all<-readRDS("sce.all.final.cellname.rds")
# 获取barcode 和细胞类型对应的表
bc <- data.frame("Barcode"=rownames(sce.all@meta.data),
"Celltype"=sce.all@meta.data$seurat_annotations) %>%
separate(Barcode,into=c("BC","suffix"),sep="-") %>%
select(-suffix) %>% as.data.frame()
# 保存一下
write.table(bc, file = "bc_celltype_assignments.txt", sep="\t", row.names = FALSE,
col.names = FALSE, quote = FALSE)
head(bc)
> BC Celltype
> AAACCTGAGAATGTGT Pro-B_cell_CD34+
> AAACCTGAGATGGGTC Pro-B_cell_CD34+
> AAACCTGAGCAGGTCA Pro-B_cell_CD34+
> AAACCTGAGGCTCATT Pro-B_cell_CD34+
> AAACCTGAGTCAAGCG Monocytes
> AAACCTGCAAAGGAAG Pro-B_cell_CD34+
bc_clust <- c("scisorseqr/bc_celltype_assignments.txt")
fastqFolder <- c("scisorseqr/fastqFile/OC_SRR25278608/")
########### 非常多的坑啊啊啊啊啊 ############
有可能有些数据的barcode是普通单细胞数据,用的是v3,也有的是单细胞ATAC-seq,用的是ARC-v1。
1. 有哪些barcode?如下
- https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist
2. 我下的是公共数据,我不知道它用的是什么barcode。
- cellranger识别barcode的时候默认是“auto”自动识别。如果成功跑下来没报错,那么会输出一个sample/out/web_summary.html的文件,里面就有说这个文件用的是什么barcode集
3. fqFolder:fastq.gz所在的“文件夹”路径(最后面要加上“/”,不然也会报错!),而不是文件夹里的fastq.gz文件。
- 文件夹里只能有一个fastq,且和文件夹同名。e.g.文件夹:ABC;文件:ABC.fastq.gz
- 否则就会报错“cat: 'getBarcodeOutput/OutputRaw//DeconvBC*.csv': No such file or directory”
4. bc_clust:上一步得到的bc_celltype_assignments.txt所在路径,而不是读入文件的data.frame
- 否则就会报错
“cat: './OutputRaw//tmp*': No such file or directory
rm: cannot remove './OutputRaw//tmp*': No such file or directory
awk: cmd. line:7: fatal: division by zero attempted
Warning message:
In system(run_bc_deConv) : error in running command”
5. 如果你有多个文件(replicate)只能一个一个处理,所以可以写个for循环....
################# 非常多的坑啊啊啊啊啊 ############
## run command
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
chemistry = "v3", concatenate = FALSE, filterReads = FALSE)
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
chemistry = "ARC-v1", concatenate = FALSE, filterReads = FALSE)
第二步: Align your fastq files to the reference
这个我自己跑minimap2了,因为还有其他东西要跑,所以之前就跑好了这个文件。
软件里跑的minimap2参数是-ax splice --secondary=no ,供参考
软件跑minimap2的结果放在MMoutput的文件夹下,里面有bam文件和REPORT.minimap2
并且软件这步还会创一个Misc的文件夹,放了fastqGuide和bamGuide两个文件。
- bamGuide:bam文件绝对路径
- fastqGuide:两列,第一列为fastq名,\t,第二列为fastq.gz绝对路径。
如果用自己之前就跑好的bam文件的话需要自行创好以上文件
MMalign(fqFolder = fastqFolder, mmProgPath = 'miniconda3/envs/scisorseqr/bin/minimap2',refGenome='genomes/hg38.fa',numThreads = 16)
第三步: Map and filter for full-length, spliced reads
这里需要一个按染色体分解参考基因组的路径seqDir,可以用faidx解决
pip install pyfaidx
mkdir chr_fa
cd chr_fa
faidx -x PATH/TO/hg38.fa
gzip ./*
MapAndFilter('LRoutput',annoGZ='database/hg38.gtf.gz',numThreads=24,seqDir="chr_fa",genomeVersion="hg38")
报错:
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff.gz: No such file or directory
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.gff.gz: No such file or directory
gzip: LRoutput/newIsoforms_vs_Anno_ignoreAnno/stretches: No such file or directory
mv: cannot stat 'tmpDir/mapping.bestperRead.bam': No such file or directory
原因:
所在文件夹和各文件夹位置错了。脚本是按相对路径寻找文件,所以各文件的结构非常重要。需要调整文件夹成如下结构,bamGuide内容也需要改动:
并且这一步需要在mmalign目录下跑
|-- test_scisorseqr
| |-- bc_celltype_assignments
| |-- test_scisorseqr.fastq.gz
| |-- output
| | | |-- OutputFiltered
| | | | |-- FilteredDeconvBC_AllFiles.csv
| | | |-- OutputRaw
| | | | |-- AllFiles_Summary
| | | | |-- DeconvBC_TestHipp.csv
| | | | |-- DeconvBC_TestHipp_summary
| | | | |-- TooShort.csv
| | | | |-- DeconvBC_TestPFC.csv
| | | | |-- DeconvBC_TestPFC_summary
| |-- mmalign
| | |-- MMoutput
| | | |-- REPORT.minimap
| | | |-- TestHipp.bam
| | | |-- TestPFC.bam
| | |-- Misc
| | | |-- bamGuide
| | | |-- fastqGuide
以下是我的输出,供参考(这步跑好之后最好把LRoutput备份一份,免得后面又报错,会把里面的文件给删掉........又得重新跑)
bedtools found in path
samtools found in path
#################################
# RUNNING [miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash/mapAndAlignReads.sh]
# Current date: Fri Nov 15 11:12:58 AM CST 2024
# Current dir: mmalign
+++++++++++++++++++++++++ 1. arguments
fastqGuide=Misc/fastqGuide
outdir=LRoutput
tmpDir=tmpDir
numThreads=24
seqDirectory=database/chr_fa
annoGZ=database/hg38.gtf.gz
mappingBAMGuide=Misc/bamGuide
scriptDir=miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash
genome + version=hg38
++++++++++++++++++ 1b. deduced from arguments
++++++++++++++++++ 1c. input check
[bam_merge] File 'tmpDir/mapping.bam' exists. Please apply '-f' to overwrite. Abort.
+++++++++++++++++++++ 2b. sorting annotation and projection
+++++++++++++++++++++ 2b1 sorting annotation
real 1m6.919s
user 0m32.154s
sys 0m4.844s
++++++++++++++++++ 2c. how many were well mapped
+++++++++++++++ 2c1 finding best hits
wellMapped: mmalign/MMoutput/Hipp_SN_f1_ONT_Run1.bam 20993299 867037 274931 2532303 0.886197
+++++++++++++++ 2c2 sam to bam conversion
+++++++++++++++++++++ 2d. removing ribosomal RNA hits
++++++++++++++++++ 2d1. finding ribosomal RNA hits
++++++++++++++++++ 2d2. removing ribosomal RNA hits
++++++++++++++++++++++++ 3. further analysis
+++++++++++++++++++++ 3a. checking consensus
++++++++++++++++++ 3a1. getting introns and exons in gff format
real 14m32.392s
user 11m23.011s
sys 0m35.452s
real 17m56.608s
user 10m38.717s
sys 0m34.499s
++++++++++++++++++ 3a2. getting dinucleotides at intron borders
++++++++++++++++++ 3a2.a. preparing commands for parallel execution
database/chr_fa/chr1.fa.gz
database/chr_fa/chr10.fa.gz
......
database/chr_fa/chrY.fa.gz
+++++++++++++++ 3a2b execution
## 0. params
# 0a. read params
# 0b. print params
commandFile=LRoutput/parallel.comp.anno.guide.3.siteSeq
n=24
## 1. reading organizational data
## 2. execution
real 3m8.427s
user 45m43.905s
sys 7m17.967s
+++++++++++++++ 3a2c collecting results and removing temporary files
cat tmpDir/site_sequence.chr1 tmpDir/site_sequence.chr10 tmpDir/site_sequence.chr10_GL383545v1_alt tmpDir/site_sequence.chr10_KI270824v1_alt tmpDir/site_sequence.chr10_KQ090021v1_fix ......
tmpDir/site_sequence.chr10_ML143354v1_fix tmpDir/site_sequence.chr10_ML143355v1_fix
tmpDir/site_sequence.chrY
real 6m40.000s
user 6m20.356s
sys 0m4.443s
+++++++++++++++++++++ 3b. strand correction and follow up analysis
++++++++++++++++++ 3b1. getting mapping.bestperRead.noRiboRNA.bam in gff format
real 12m29.409s
user 11m48.126s
sys 0m27.614s
real 4m15.684s
user 3m26.646s
sys 0m3.599s
awk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
++++++++++++++++++ 3b3. getting introns
awk: cmd. line:2: warning: regexp escape sequence `\@' is not a known regexp operator
real 7m58.682s
user 6m28.702s
sys 0m23.712s
++++++++++++++++++ 3b4. getting genes
### executing pB.getGenes.awk
## A. parsing file2=LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff
## B. parsing annotation:
## C. going over all annotated transcripts:
## D. counting the number of splice sites per gene:
## E. checking whether a read has equal numbers of splice sites with multiple genes:
real 85m17.427s
user 84m36.266s
sys 1m16.844s
+++++++++++++++++++++++++ 4. zipping what has not be zipped before
+++++++++++++++++++++++++ 5. new isoforms
++++++++++++++++++++++ 5a. annotation
real 1m35.860s
user 1m39.012s
sys 0m1.997s
++++++++++++++++++++++ 5b. RNAseq
real 52m13.298s
user 56m2.148s
sys 0m50.638s
real 5m57.071s
user 5m20.826s
sys 0m14.671s
++++++++++++++++++++++ 5c. comparison
## 0. preliminaries
## 1. annotation: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 1a. parsing
## 1b. getting all the stretches
## 2. prediction: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 2a. parsing
## 2b. are the RNAseq stretches annotated ?
+++++++++++++++++++++++++ 6. done
第四步:Concatenate all the information collected so far
目前还是在mmalign目录下
# 整理所有结果
InfoPerLongRead(barcodeOutputFile =bc_clust,
mapAndFilterOut = 'LRoutput/',
minTimesIsoObserve = 5, rmTmpFolder = FALSE)
如果这里的rmTmpFolder设置成TRUE会报错...我也不知道为什么
在这步我没有输出任何信息。但是也没有报错。有可能是因为我输入的fastq.gz是sce.obj的一部分细胞数据,而不是全部。这个我实在解决不了了,只能终止到这步了。
第五步:Quantify the various isoforms observed in your data
IsoQuant(AllInfoFile = 'LongReadInfo/AllInfo',Iso = TRUE)
第六步: Differential isoform analysis
contig文件内容是想比较的细胞类型,如下:
Comparison1Celltypes | Comparison2Celltypes | ||
---|---|---|---|
HippNeuron | Hipp_ExciteNeuron,Hipp_InhibNeuron,Hipp_GranuleNB | PFCNeuron | PFC_ExciteNeuron,PFC_InhibNeuron |
HippNonNeuron | Hipp_Astro,Hipp_ChorPlex,Hipp_Ependymal,Hipp_Myeloid,Hipp_NIPCs,Hipp_Oligo,Hipp_RGL,Hipp_Vasc | PFCNonNeuron | PFC_Astro,PFC_Oligo,PFC_Myeloid,PFC_Vasc |
HippExcite | Hipp_ExciteNeuron | PFCExcite | PFC_ExciteNeuron |
HippInhib | Hipp_InhibNeuron | PFCInhib | PFC_InhibNeuron |
# 差异分析
config <- c("userInput/config") # 仍然是文件路径
DiffSplicingAnalysis(configFile = config,typeOfTest = 'Iso')
# Iso表示 Full-length isoform
# 这里typeOfTest还可以选择TSS、PolyA、Exon(这个需要跑5b的步骤,具体看说明书)
第七步: Visualizations
该函数绘制了一个三角形热图,表示一个目录中两两比较的DIE基因百分比。condensedCellTypes 是细胞类型list,如下所示:
P7Hipp_Astro
P7Hipp_ChorPlex
P7Hipp_DivOligo
P7Hipp_Ependymal
P7Hipp_ExcitNeuron
P7Hipp_GranuleNB
P7Hipp_InhibNeuron
P7Hipp_Myeloid
P7Hipp_NIPCs
P7Hipp_Oligo
P7Hipp_RGL
P7Hipp_Vasc
以上的cell type要对应于AllInfo文件里的 barcode-celltype list。如果输入的cell type类型没有计算成对DIE,则它将输出全零(比如这里的DivOligo)
# 可视化
condensedCellTypes =c("userInput/condensedCellTypes") #文件路径
triHeatmap(treeDir = 'TreeTraversal_Iso/',
comparisonList = condensedCellTypes,
typeOfTest = "Iso",
outName = condensedCellTypes)