单细胞长读长转录本异构体差异表达分析DIE

软件——scisorseqr：https://github.com/noush-joglekar/scisorseqr
前期工作准备：
1.二代数据跑完cellranger，得到barcode、feature、matrix文件。输入到seurat进行数据质控、细胞分群、细胞类型注释
2.如果没有二代数据，只有三代数据，也可以跑sicelore、Isosceles等软件。得到[isoform x cell] matrix后输入到seurat。

安装

软件依赖：

STARlong for PacBio reads
Minimap2 for Oxford Nanopore (or PacBio) reads
samtools
bedtools
python version >=3.7
r: devtools，Seurat，tidyr，dplyr

devtools::install_github('noush-joglekar/scisorseqr',build_vignettes = FALSE)
# 教程教的是build_vignettes = TRUE，但是一般都是在linux下跑这个软件，build_vignettes需要打开浏览器。就很麻烦。github里面自带教程，可以直接下，没必要再生成
# https://github.com/noush-joglekar/scisorseqr/blob/master/vignettes/scisorseqr-StandardWorkflow.pdf
# 如果你在windows的R下载的话，可以选择build_vignettes = TRUE
# 但有可能会报错：pdflatex: command not found
报错的话安装以下：
install.packages("tinytex")
tinytex::install_tinytex()

虽然有教程，但是还是有点简略。建立起自己的认识，还是写一写。预先告知，我跑到第四步就跑不下去了，后面几步我也不知道会遇到什么报错结果orz

第一步： Barcode deconvolution

library(Seurat)
library(tidyr)
library(dplyr)
library(scisorseqr)


# 读入前期细胞注释好的seurat object
sce.all<-readRDS("sce.all.final.cellname.rds")

# 获取barcode 和细胞类型对应的表
bc <- data.frame("Barcode"=rownames(sce.all@meta.data),
                 "Celltype"=sce.all@meta.data$seurat_annotations) %>%
                  separate(Barcode,into=c("BC","suffix"),sep="-") %>%
                  select(-suffix) %>% as.data.frame()
# 保存一下
write.table(bc, file = "bc_celltype_assignments.txt", sep="\t", row.names = FALSE,
            col.names = FALSE, quote = FALSE)
head(bc)
> BC         Celltype
> AAACCTGAGAATGTGT Pro-B_cell_CD34+
> AAACCTGAGATGGGTC Pro-B_cell_CD34+
> AAACCTGAGCAGGTCA Pro-B_cell_CD34+
> AAACCTGAGGCTCATT Pro-B_cell_CD34+
> AAACCTGAGTCAAGCG        Monocytes
> AAACCTGCAAAGGAAG Pro-B_cell_CD34+
   

bc_clust <- c("scisorseqr/bc_celltype_assignments.txt")
fastqFolder <- c("scisorseqr/fastqFile/OC_SRR25278608/")

########### 非常多的坑啊啊啊啊啊 ############
有可能有些数据的barcode是普通单细胞数据，用的是v3，也有的是单细胞ATAC-seq，用的是ARC-v1。
1. 有哪些barcode？如下
- https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist
2. 我下的是公共数据，我不知道它用的是什么barcode。
- cellranger识别barcode的时候默认是“auto”自动识别。如果成功跑下来没报错，那么会输出一个sample/out/web_summary.html的文件，里面就有说这个文件用的是什么barcode集
3. fqFolder：fastq.gz所在的“文件夹”路径(最后面要加上“/”，不然也会报错！)，而不是文件夹里的fastq.gz文件。
- 文件夹里只能有一个fastq，且和文件夹同名。e.g.文件夹：ABC；文件：ABC.fastq.gz
- 否则就会报错“cat: 'getBarcodeOutput/OutputRaw//DeconvBC*.csv': No such file or directory”
4. bc_clust：上一步得到的bc_celltype_assignments.txt所在路径，而不是读入文件的data.frame
- 否则就会报错
“cat: './OutputRaw//tmp*': No such file or directory
rm: cannot remove './OutputRaw//tmp*': No such file or directory
awk: cmd. line:7: fatal: division by zero attempted
Warning message:
In system(run_bc_deConv) : error in running command”
5. 如果你有多个文件（replicate）只能一个一个处理，所以可以写个for循环....
################# 非常多的坑啊啊啊啊啊 ############


## run command
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
            chemistry = "v3", concatenate = FALSE, filterReads = FALSE)
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
            chemistry = "ARC-v1", concatenate = FALSE, filterReads = FALSE)

第二步: Align your fastq files to the reference

这个我自己跑minimap2了，因为还有其他东西要跑，所以之前就跑好了这个文件。
软件里跑的minimap2参数是-ax splice --secondary=no ，供参考
软件跑minimap2的结果放在MMoutput的文件夹下，里面有bam文件和REPORT.minimap2
并且软件这步还会创一个Misc的文件夹，放了fastqGuide和bamGuide两个文件。

bamGuide：bam文件绝对路径
fastqGuide：两列，第一列为fastq名，\t，第二列为fastq.gz绝对路径。
如果用自己之前就跑好的bam文件的话需要自行创好以上文件

MMalign(fqFolder = fastqFolder, mmProgPath = 'miniconda3/envs/scisorseqr/bin/minimap2',refGenome='genomes/hg38.fa',numThreads = 16)

第三步： Map and filter for full-length, spliced reads

这里需要一个按染色体分解参考基因组的路径seqDir，可以用faidx解决

pip install pyfaidx
mkdir chr_fa
cd chr_fa
faidx -x PATH/TO/hg38.fa
gzip ./*

MapAndFilter('LRoutput',annoGZ='database/hg38.gtf.gz',numThreads=24,seqDir="chr_fa",genomeVersion="hg38")

报错：
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff.gz: No such file or directory
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.gff.gz: No such file or directory
gzip: LRoutput/newIsoforms_vs_Anno_ignoreAnno/stretches: No such file or directory 
mv: cannot stat 'tmpDir/mapping.bestperRead.bam': No such file or directory
原因：
所在文件夹和各文件夹位置错了。脚本是按相对路径寻找文件，所以各文件的结构非常重要。需要调整文件夹成如下结构，bamGuide内容也需要改动：
并且这一步需要在mmalign目录下跑
|-- test_scisorseqr
|   |-- bc_celltype_assignments
|   |-- test_scisorseqr.fastq.gz
|   |-- output
|   |   |   |-- OutputFiltered
|   |   |   |   |-- FilteredDeconvBC_AllFiles.csv
|   |   |   |-- OutputRaw
|   |   |   |   |-- AllFiles_Summary
|   |   |   |   |-- DeconvBC_TestHipp.csv
|   |   |   |   |-- DeconvBC_TestHipp_summary
|   |   |   |   |-- TooShort.csv
|   |   |   |   |-- DeconvBC_TestPFC.csv
|   |   |   |   |-- DeconvBC_TestPFC_summary
|   |-- mmalign
|   |   |-- MMoutput
|   |   |   |-- REPORT.minimap
|   |   |   |-- TestHipp.bam
|   |   |   |-- TestPFC.bam
|   |   |-- Misc
|   |   |   |-- bamGuide
|   |   |   |-- fastqGuide

以下是我的输出，供参考（这步跑好之后最好把LRoutput备份一份，免得后面又报错，会把里面的文件给删掉........又得重新跑）

bedtools found in path
samtools found in path
#################################
# RUNNING [miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash/mapAndAlignReads.sh]
# Current date: Fri Nov 15 11:12:58 AM CST 2024
# Current dir: mmalign
+++++++++++++++++++++++++ 1. arguments
fastqGuide=Misc/fastqGuide
outdir=LRoutput
tmpDir=tmpDir
numThreads=24
seqDirectory=database/chr_fa
annoGZ=database/hg38.gtf.gz
mappingBAMGuide=Misc/bamGuide
scriptDir=miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash
genome + version=hg38
++++++++++++++++++ 1b. deduced from arguments
++++++++++++++++++ 1c. input check
[bam_merge] File 'tmpDir/mapping.bam' exists. Please apply '-f' to overwrite. Abort.
+++++++++++++++++++++ 2b. sorting annotation and projection
+++++++++++++++++++++ 2b1 sorting annotation
 
real    1m6.919s
user    0m32.154s
sys 0m4.844s
++++++++++++++++++ 2c. how many were well mapped
+++++++++++++++ 2c1 finding best hits
wellMapped: mmalign/MMoutput/Hipp_SN_f1_ONT_Run1.bam    20993299    867037  274931  2532303 0.886197
+++++++++++++++ 2c2 sam to bam conversion
+++++++++++++++++++++ 2d. removing ribosomal RNA hits
++++++++++++++++++ 2d1. finding ribosomal RNA hits
++++++++++++++++++ 2d2. removing ribosomal RNA hits
++++++++++++++++++++++++ 3. further analysis
+++++++++++++++++++++ 3a. checking consensus
++++++++++++++++++ 3a1. getting introns and exons in gff format
               
real    14m32.392s
user    11m23.011s
sys 0m35.452s
                  
real    17m56.608s
user    10m38.717s
sys 0m34.499s
++++++++++++++++++ 3a2. getting dinucleotides at intron borders
++++++++++++++++++ 3a2.a. preparing commands for parallel execution
 database/chr_fa/chr1.fa.gz
database/chr_fa/chr10.fa.gz
......
database/chr_fa/chrY.fa.gz
+++++++++++++++ 3a2b execution 
## 0. params
# 0a. read params
# 0b. print params
commandFile=LRoutput/parallel.comp.anno.guide.3.siteSeq
n=24
## 1. reading organizational data
## 2. execution
   
real    3m8.427s
user    45m43.905s
sys 7m17.967s
+++++++++++++++ 3a2c collecting results and removing temporary files
cat tmpDir/site_sequence.chr1 tmpDir/site_sequence.chr10 tmpDir/site_sequence.chr10_GL383545v1_alt tmpDir/site_sequence.chr10_KI270824v1_alt tmpDir/site_sequence.chr10_KQ090021v1_fix ......
tmpDir/site_sequence.chr10_ML143354v1_fix tmpDir/site_sequence.chr10_ML143355v1_fix 
tmpDir/site_sequence.chrY
       
real    6m40.000s
user    6m20.356s
sys 0m4.443s
+++++++++++++++++++++ 3b. strand correction and follow up analysis
++++++++++++++++++ 3b1. getting mapping.bestperRead.noRiboRNA.bam in gff format 
            
real    12m29.409s
user    11m48.126s
sys 0m27.614s
    
real    4m15.684s
user    3m26.646s
sys 0m3.599s
awk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
++++++++++++++++++ 3b3. getting introns
awk: cmd. line:2: warning: regexp escape sequence `\@' is not a known regexp operator
        
real    7m58.682s
user    6m28.702s
sys 0m23.712s
++++++++++++++++++ 3b4. getting genes
### executing pB.getGenes.awk
## A. parsing file2=LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff
## B. parsing annotation: 
## C. going over all annotated transcripts: 
## D. counting the number of splice sites per gene:
## E. checking whether a read has equal numbers of splice sites with multiple genes: 
   
real    85m17.427s
user    84m36.266s
sys 1m16.844s
+++++++++++++++++++++++++ 4. zipping what has not be zipped before
+++++++++++++++++++++++++ 5. new isoforms
++++++++++++++++++++++ 5a. annotation
 
real    1m35.860s
user    1m39.012s
sys 0m1.997s
++++++++++++++++++++++ 5b. RNAseq
                                                    
real    52m13.298s
user    56m2.148s
sys 0m50.638s
      
real    5m57.071s
user    5m20.826s
sys 0m14.671s
++++++++++++++++++++++ 5c. comparison
## 0. preliminaries
## 1. annotation: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 1a. parsing
## 1b. getting all the stretches
## 2. prediction: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 2a. parsing
## 2b. are the RNAseq stretches annotated ?
+++++++++++++++++++++++++ 6. done

第四步：Concatenate all the information collected so far

目前还是在mmalign目录下
# 整理所有结果
InfoPerLongRead(barcodeOutputFile =bc_clust,
                mapAndFilterOut = 'LRoutput/',
                minTimesIsoObserve = 5, rmTmpFolder = FALSE)
如果这里的rmTmpFolder设置成TRUE会报错...我也不知道为什么

在这步我没有输出任何信息。但是也没有报错。有可能是因为我输入的fastq.gz是sce.obj的一部分细胞数据，而不是全部。这个我实在解决不了了，只能终止到这步了。

第五步：Quantify the various isoforms observed in your data

IsoQuant(AllInfoFile = 'LongReadInfo/AllInfo',Iso = TRUE)

第六步： Differential isoform analysis

contig文件内容是想比较的细胞类型，如下：

Comparison1Celltypes		Comparison2Celltypes
HippNeuron	Hipp_ExciteNeuron,Hipp_InhibNeuron,Hipp_GranuleNB	PFCNeuron	PFC_ExciteNeuron,PFC_InhibNeuron
HippNonNeuron	Hipp_Astro,Hipp_ChorPlex,Hipp_Ependymal,Hipp_Myeloid,Hipp_NIPCs,Hipp_Oligo,Hipp_RGL,Hipp_Vasc	PFCNonNeuron	PFC_Astro,PFC_Oligo,PFC_Myeloid,PFC_Vasc
HippExcite	Hipp_ExciteNeuron	PFCExcite	PFC_ExciteNeuron
HippInhib	Hipp_InhibNeuron	PFCInhib	PFC_InhibNeuron

# 差异分析
config <- c("userInput/config") # 仍然是文件路径
DiffSplicingAnalysis(configFile = config,typeOfTest = 'Iso') 
# Iso表示 Full-length isoform
# 这里typeOfTest还可以选择TSS、PolyA、Exon(这个需要跑5b的步骤，具体看说明书)

第七步： Visualizations

该函数绘制了一个三角形热图，表示一个目录中两两比较的DIE基因百分比。condensedCellTypes 是细胞类型list，如下所示：

P7Hipp_Astro
P7Hipp_ChorPlex
P7Hipp_DivOligo
P7Hipp_Ependymal
P7Hipp_ExcitNeuron
P7Hipp_GranuleNB
P7Hipp_InhibNeuron
P7Hipp_Myeloid
P7Hipp_NIPCs
P7Hipp_Oligo
P7Hipp_RGL
P7Hipp_Vasc

以上的cell type要对应于AllInfo文件里的 barcode-celltype list。如果输入的cell type类型没有计算成对DIE，则它将输出全零（比如这里的DivOligo）

# 可视化
condensedCellTypes =c("userInput/condensedCellTypes") #文件路径
triHeatmap(treeDir = 'TreeTraversal_Iso/',
           comparisonList = condensedCellTypes,
           typeOfTest = "Iso",
           outName = condensedCellTypes)

热图

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 214,951评论 6赞 497
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,606评论 3赞 389
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 160,601评论 0赞 350
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,478评论 1赞 288
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,565评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,587评论 1赞 293
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,590评论 3赞 414
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,337评论 0赞 270
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,785评论 1赞 307
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,096评论 2赞 330
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,273评论 1赞 344
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,935评论 5赞 339
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,578评论 3赞 322
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,199评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,440评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,163评论 2赞 366
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,133评论 2赞 352