一、Sepal是干什么的?
原文:sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling
识别空间转录组中具有空间模式的基因(genes with spatial patterns),并给出强弱的排序。
对具有空间模式的基因(取排序靠前的n个基因)进行聚类(pattern families),使得同一个类中的基因具有相同的空间模式,进而可以对每个类做生物解释(biological processes)。
二、Sepal的原理
其他方法往往假设数据服从某种分布,并依赖于假设检验(如:Trendsceek,SpatialDE,SPARK)。当数据与假设的分布不一致时,就不能得到理想的结果。
Sepal采取了不同的策略,文章认为基因在组织上的分布类似于物质的扩散,根据Fick第二定理和基因的在空间上的表达数据可以计算出每种基因的扩散时间,扩散时间更长说明更具空间模式,扩散时间更短说明分布更随机。因此,根据扩散时间可以给出基因具有空间模式由强到弱的排序。
具体公式与说明请见原文。
三、Sepal代码实现
1. 得到扩散时间表
sepal run -c counts.csv -mo 10 -mc 10 -o . -ar 1k
- -c 输入文件可以是.csv、.tsv、.h5ad(来自scanpy)格式,文件内容按照 n_locations x n_genes 排列,否则用 -t(或 --transpose)转置。
- -ar 标注空间转录组类型,包括 visium,2k,1k。visium是10X的数据,1k是ST数据,2k不清楚是什么。
- -mo、-mc、-ks等用来过滤基因。
- -o 输出文件夹
- average 表示扩散时间,被scale到 [0,1] 区间。
2. 排名靠前的基因画图
sepal analyze -c counts.csv -r *-top-diffusion-times.tsv -ar 1k -o . inspect -ng 20 -nc 5
- -r sepal run 得到的.tsv文件
- -ng 基因个数
- -nc 每行画几个基因
3. 排序靠前基因聚类,得到pattern families
sepal analyze -c ./counts.csv -r *-top-diffusion-times.tsv -ar 1k -o . -ng 100 -nbg 100 -eps 0.85 --plot -nc 10
- -nbg 取前多少的基因进行PCA
- -ng 对前多少个基因进行聚类
- -eps PCA方差贡献率的阈值,聚类数目与PC数目一致,-eps值越大,类的数目越多。
4. 对每个类(family)进行富集分析
sepal analyze -c counts.csv -r *-top-diffusion-times.tsv -ar 1k -o . fea -fl *-family-index.tsv
- -fl sepal analyze famliy 输出的文件,标注了基因所属类别。
- -dbs 参考的数据库,默认使用 GO:BP。
四、详细参数
- sepal run -h
.\ /.
< ~O~ >
┌─┐┌─┐┌─┐┌─┐┬ '/_\'
└─┐├┤ ├─┘├─┤│ \ | /
└─┘└─┘┴ ┴ ┴┴─┘ \|/
Version 1.0.0 | see https://github.com/almaan/sepal
usage: sepal run [-h] -c COUNT_FILES [COUNT_FILES ...] -o OUT_DIR [-t]
[-mo MIN_OCCURANCE] [-mc MIN_COUNTS] [-mzp MAX_ZERO_FRACTION]
[-ks] [-dt TIME_STEP] [-eps THRESHOLD] [-dr DIFFUSION_RATE]
[-nw NUM_WORKERS] -ar {visium,2k,1k,unstructured} [-z]
[-ps PSEUDOCOUNT]
optional arguments:
-h, --help show this help message and exit
-c COUNT_FILES [COUNT_FILES ...], --count_files COUNT_FILES [COUNT_FILES ...]
count files (default: None)
-o OUT_DIR, --out_dir OUT_DIR
output directory (default: None)
-t, --transpose transpose count matrix (default: False)
-mo MIN_OCCURANCE, --min_occurance MIN_OCCURANCE
minimum number of spot that gene has to occur within
(default: 5)
-mc MIN_COUNTS, --min_counts MIN_COUNTS
minimum number of total counts for a gene (default:
20)
-mzp MAX_ZERO_FRACTION, --max_zero_fraction MAX_ZERO_FRACTION
max fraction of spots with zero counts allowed for
gene (default: 1.0)
-ks, --keep_spurious include RP and MT profiles (default: False)
-dt TIME_STEP, --time_step TIME_STEP
minimum number of total counts for a gene (default:
0.001)
-eps THRESHOLD, --threshold THRESHOLD
threshold (eps) to use when assessing convergence
(default: 1e-08)
-dr DIFFUSION_RATE, --diffusion_rate DIFFUSION_RATE
Diffusion rate (D) to use in simulations (default: 1)
-nw NUM_WORKERS, --num_workers NUM_WORKERS
number of workers to use. If no number is provided,
the maximum number of available workers will be used.
(default: None)
-ar {visium,2k,1k,unstructured}, --array {visium,2k,1k,unstructured}
array type (default: None)
-z, --timeit time analysis (default: False)
-ps PSEUDOCOUNT, --pseudocount PSEUDOCOUNT
pseudocount in normalization (default: 2.0)
- sepal analyze -h
_
.\ /.
< ~O~ >
┌─┐┌─┐┌─┐┌─┐┬ '/_\'
└─┐├┤ ├─┘├─┤│ \ | /
└─┘└─┘┴ ┴ ┴┴─┘ \|/
Version 1.0.0 | see https://github.com/almaan/sepal
usage: sepal analyze [-h] [-c COUNT_DATA] [-r RESULTS] -o OUT_DIR
[-ar {visium,2k,1k,unstructured}] [-tr] [-rt]
[-ss SIDE_SIZE] [-nc N_COLS] [-qs QUANTILE_SCALING]
[-st SPLIT_TITLE SPLIT_TITLE] [-ps PSEUDOCOUNT]
[-sig SIGMA]
{inspect,family,fea} ...
positional arguments:
{inspect,family,fea}
optional arguments:
-h, --help show this help message and exit
-c COUNT_DATA, --count_data COUNT_DATA
count files (default: None)
-r RESULTS, --results RESULTS
output directory (default: None)
-o OUT_DIR, --out_dir OUT_DIR
output directory (default: None)
-ar {visium,2k,1k,unstructured}, --array {visium,2k,1k,unstructured}
array type (default: None)
-tr, --transpose transpose count matrix (default: False)
-rt, --rotate
-ss SIDE_SIZE, --side_size SIDE_SIZE
side length in plot (default: 350)
-nc N_COLS, --n_cols N_COLS
number f columns in plot (default: 5)
-qs QUANTILE_SCALING, --quantile_scaling QUANTILE_SCALING
quantile to use for quantile scaling (default: None)
-st SPLIT_TITLE SPLIT_TITLE, --split_title SPLIT_TITLE SPLIT_TITLE
split title (default: None)
-ps PSEUDOCOUNT, --pseudocount PSEUDOCOUNT
pseudocount in normalization (default: 2.0)
-sig SIGMA, --sigma SIGMA
sensitivity for selection of top genes (default: 1.5)
- sepal analyze inspect -h
_
.\ /.
< ~O~ >
┌─┐┌─┐┌─┐┌─┐┬ '/_\'
└─┐├┤ ├─┘├─┤│ \ | /
└─┘└─┘┴ ┴ ┴┴─┘ \|/
Version 1.0.0 | see https://github.com/almaan/sepal
usage: sepal analyze inspect [-h] [-sd STYLE_DICT] [-nc N_COLS] [-pv]
[-ng N_GENES]
optional arguments:
-h, --help show this help message and exit
-sd STYLE_DICT, --style_dict STYLE_DICT
plot style as dict (default: None)
-nc N_COLS, --n_cols N_COLS
number f columns in plot (default: 5)
-pv, --pval values are pvals (default: False)
-ng N_GENES, --n_genes N_GENES
number of genes to visualize (default: None)
- sepal analyze family -h
_
.\ /.
< ~O~ >
┌─┐┌─┐┌─┐┌─┐┬ '/_\'
└─┐├┤ ├─┘├─┤│ \ | /
└─┘└─┘┴ ┴ ┴┴─┘ \|/
Version 1.0.0 | see https://github.com/almaan/sepal
usage: sepal analyze family [-h] [-ng N_GENES] [-nbg N_BASE_GENES]
[-eps THRESHOLD] [-p] [-sd STYLE_DICT]
[-nc N_COLS]
optional arguments:
-h, --help show this help message and exit
-ng N_GENES, --n_genes N_GENES
included genes (default: 100)
-nbg N_BASE_GENES, --n_base_genes N_BASE_GENES
basis genes (default: None)
-eps THRESHOLD, --threshold THRESHOLD
threshold in clustering (default: 0.995)
-p, --plot threshold in clustering (default: False)
-sd STYLE_DICT, --style_dict STYLE_DICT
plot style as dict (default: None)
-nc N_COLS, --n_cols N_COLS
number f columns in plot (default: 5)
- sepal analyze fea -h
_
.\ /.
< ~O~ >
┌─┐┌─┐┌─┐┌─┐┬ '/_\'
└─┐├┤ ├─┘├─┤│ \ | /
└─┘└─┘┴ ┴ ┴┴─┘ \|/
Version 1.0.0 | see https://github.com/almaan/sepal
usage: sepal analyze fea [-h] -fl FAMILY_INDEX [-or ORGANISM]
[-dbs DATABASES [DATABASES ...]] [-ltx] [-md]
[-sa START_AT]
optional arguments:
-h, --help show this help message and exit
-fl FAMILY_INDEX, --family_index FAMILY_INDEX
path to family indices (default: None)
-or ORGANISM, --organism ORGANISM
organism to query against. See g:Profiler
documentation for supported organisms (default:
hsapiens)
-dbs DATABASES [DATABASES ...], --databases DATABASES [DATABASES ...]
database to use in enrichment analysis (default:
['GO:BP'])
-ltx, --latex save latex formatted table (default: False)
-md, --markdown save markdown formatted table (default: False)
-sa START_AT, --start_at START_AT
start family enumeration at (default: 0)