一、Sepal是干什么的？

原文：sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling

识别空间转录组中具有空间模式的基因（genes with spatial patterns），并给出强弱的排序。
对具有空间模式的基因（取排序靠前的n个基因）进行聚类（pattern families），使得同一个类中的基因具有相同的空间模式，进而可以对每个类做生物解释（biological processes）。

二、Sepal的原理

其他方法往往假设数据服从某种分布，并依赖于假设检验（如：Trendsceek，SpatialDE，SPARK）。当数据与假设的分布不一致时，就不能得到理想的结果。

Sepal采取了不同的策略，文章认为基因在组织上的分布类似于物质的扩散，根据Fick第二定理和基因的在空间上的表达数据可以计算出每种基因的扩散时间，扩散时间更长说明更具空间模式，扩散时间更短说明分布更随机。因此，根据扩散时间可以给出基因具有空间模式由强到弱的排序。

具体公式与说明请见原文。

三、Sepal代码实现

原文GitHub代码（python）

1. 得到扩散时间表

sepal run -c counts.csv  -mo 10 -mc 10 -o . -ar 1k

-c 输入文件可以是.csv、.tsv、.h5ad（来自scanpy）格式，文件内容按照 n_locations x n_genes 排列，否则用 -t（或 --transpose）转置。
-ar 标注空间转录组类型，包括 visium,2k,1k。visium是10X的数据，1k是ST数据，2k不清楚是什么。
-mo、-mc、-ks等用来过滤基因。
-o 输出文件夹

1.PNG

average 表示扩散时间，被scale到 [0,1] 区间。

2. 排名靠前的基因画图

sepal analyze -c counts.csv -r *-top-diffusion-times.tsv -ar 1k -o . inspect -ng 20 -nc 5

-r sepal run 得到的.tsv文件
-ng 基因个数
-nc 每行画几个基因

3. 排序靠前基因聚类，得到pattern families

sepal analyze -c ./counts.csv -r *-top-diffusion-times.tsv -ar 1k -o . -ng 100 -nbg 100 -eps 0.85 --plot -nc 10

-nbg 取前多少的基因进行PCA
-ng 对前多少个基因进行聚类
-eps PCA方差贡献率的阈值，聚类数目与PC数目一致，-eps值越大，类的数目越多。

4. 对每个类（family）进行富集分析

sepal analyze  -c counts.csv  -r *-top-diffusion-times.tsv  -ar 1k -o . fea -fl *-family-index.tsv

-fl sepal analyze famliy 输出的文件，标注了基因所属类别。
-dbs 参考的数据库，默认使用 GO:BP。

四、详细参数

sepal run -h

                  .\ /.
                 < ~O~ >
┌─┐┌─┐┌─┐┌─┐┬     '/_\'
└─┐├┤ ├─┘├─┤│     \ | /
└─┘└─┘┴  ┴ ┴┴─┘    \|/
Version 1.0.0 |  see https://github.com/almaan/sepal
usage: sepal run [-h] -c COUNT_FILES [COUNT_FILES ...] -o OUT_DIR [-t]
                 [-mo MIN_OCCURANCE] [-mc MIN_COUNTS] [-mzp MAX_ZERO_FRACTION]
                 [-ks] [-dt TIME_STEP] [-eps THRESHOLD] [-dr DIFFUSION_RATE]
                 [-nw NUM_WORKERS] -ar {visium,2k,1k,unstructured} [-z]
                 [-ps PSEUDOCOUNT]

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT_FILES [COUNT_FILES ...], --count_files COUNT_FILES [COUNT_FILES ...]
                        count files (default: None)
  -o OUT_DIR, --out_dir OUT_DIR
                        output directory (default: None)
  -t, --transpose       transpose count matrix (default: False)
  -mo MIN_OCCURANCE, --min_occurance MIN_OCCURANCE
                        minimum number of spot that gene has to occur within
                        (default: 5)
  -mc MIN_COUNTS, --min_counts MIN_COUNTS
                        minimum number of total counts for a gene (default:
                        20)
  -mzp MAX_ZERO_FRACTION, --max_zero_fraction MAX_ZERO_FRACTION
                        max fraction of spots with zero counts allowed for
                        gene (default: 1.0)
  -ks, --keep_spurious  include RP and MT profiles (default: False)
  -dt TIME_STEP, --time_step TIME_STEP
                        minimum number of total counts for a gene (default:
                        0.001)
  -eps THRESHOLD, --threshold THRESHOLD
                        threshold (eps) to use when assessing convergence
                        (default: 1e-08)
  -dr DIFFUSION_RATE, --diffusion_rate DIFFUSION_RATE
                        Diffusion rate (D) to use in simulations (default: 1)
  -nw NUM_WORKERS, --num_workers NUM_WORKERS
                        number of workers to use. If no number is provided,
                        the maximum number of available workers will be used.
                        (default: None)
  -ar {visium,2k,1k,unstructured}, --array {visium,2k,1k,unstructured}
                        array type (default: None)
  -z, --timeit          time analysis (default: False)
  -ps PSEUDOCOUNT, --pseudocount PSEUDOCOUNT
                        pseudocount in normalization (default: 2.0)

sepal analyze -h

                    _
                  .\ /.
                 < ~O~ >
┌─┐┌─┐┌─┐┌─┐┬     '/_\'
└─┐├┤ ├─┘├─┤│     \ | /
└─┘└─┘┴  ┴ ┴┴─┘    \|/
Version 1.0.0 |  see https://github.com/almaan/sepal
usage: sepal analyze [-h] [-c COUNT_DATA] [-r RESULTS] -o OUT_DIR
                     [-ar {visium,2k,1k,unstructured}] [-tr] [-rt]
                     [-ss SIDE_SIZE] [-nc N_COLS] [-qs QUANTILE_SCALING]
                     [-st SPLIT_TITLE SPLIT_TITLE] [-ps PSEUDOCOUNT]
                     [-sig SIGMA]
                     {inspect,family,fea} ...

positional arguments:
  {inspect,family,fea}

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT_DATA, --count_data COUNT_DATA
                        count files (default: None)
  -r RESULTS, --results RESULTS
                        output directory (default: None)
  -o OUT_DIR, --out_dir OUT_DIR
                        output directory (default: None)
  -ar {visium,2k,1k,unstructured}, --array {visium,2k,1k,unstructured}
                        array type (default: None)
  -tr, --transpose      transpose count matrix (default: False)
  -rt, --rotate
  -ss SIDE_SIZE, --side_size SIDE_SIZE
                        side length in plot (default: 350)
  -nc N_COLS, --n_cols N_COLS
                        number f columns in plot (default: 5)
  -qs QUANTILE_SCALING, --quantile_scaling QUANTILE_SCALING
                        quantile to use for quantile scaling (default: None)
  -st SPLIT_TITLE SPLIT_TITLE, --split_title SPLIT_TITLE SPLIT_TITLE
                        split title (default: None)
  -ps PSEUDOCOUNT, --pseudocount PSEUDOCOUNT
                        pseudocount in normalization (default: 2.0)
  -sig SIGMA, --sigma SIGMA
                        sensitivity for selection of top genes (default: 1.5)

sepal analyze inspect -h

                    _
                  .\ /.
                 < ~O~ >
┌─┐┌─┐┌─┐┌─┐┬     '/_\'
└─┐├┤ ├─┘├─┤│     \ | /
└─┘└─┘┴  ┴ ┴┴─┘    \|/
Version 1.0.0 |  see https://github.com/almaan/sepal
usage: sepal analyze inspect [-h] [-sd STYLE_DICT] [-nc N_COLS] [-pv]
                             [-ng N_GENES]

optional arguments:
  -h, --help            show this help message and exit
  -sd STYLE_DICT, --style_dict STYLE_DICT
                        plot style as dict (default: None)
  -nc N_COLS, --n_cols N_COLS
                        number f columns in plot (default: 5)
  -pv, --pval           values are pvals (default: False)
  -ng N_GENES, --n_genes N_GENES
                        number of genes to visualize (default: None)

sepal analyze family -h

                    _
                  .\ /.
                 < ~O~ >
┌─┐┌─┐┌─┐┌─┐┬     '/_\'
└─┐├┤ ├─┘├─┤│     \ | /
└─┘└─┘┴  ┴ ┴┴─┘    \|/
Version 1.0.0 |  see https://github.com/almaan/sepal
usage: sepal analyze family [-h] [-ng N_GENES] [-nbg N_BASE_GENES]
                            [-eps THRESHOLD] [-p] [-sd STYLE_DICT]
                            [-nc N_COLS]

optional arguments:
  -h, --help            show this help message and exit
  -ng N_GENES, --n_genes N_GENES
                        included genes (default: 100)
  -nbg N_BASE_GENES, --n_base_genes N_BASE_GENES
                        basis genes (default: None)
  -eps THRESHOLD, --threshold THRESHOLD
                        threshold in clustering (default: 0.995)
  -p, --plot            threshold in clustering (default: False)
  -sd STYLE_DICT, --style_dict STYLE_DICT
                        plot style as dict (default: None)
  -nc N_COLS, --n_cols N_COLS
                        number f columns in plot (default: 5)

sepal analyze fea -h

                    _
                  .\ /.
                 < ~O~ >
┌─┐┌─┐┌─┐┌─┐┬     '/_\'
└─┐├┤ ├─┘├─┤│     \ | /
└─┘└─┘┴  ┴ ┴┴─┘    \|/
Version 1.0.0 |  see https://github.com/almaan/sepal
usage: sepal analyze fea [-h] -fl FAMILY_INDEX [-or ORGANISM]
                         [-dbs DATABASES [DATABASES ...]] [-ltx] [-md]
                         [-sa START_AT]

optional arguments:
  -h, --help            show this help message and exit
  -fl FAMILY_INDEX, --family_index FAMILY_INDEX
                        path to family indices (default: None)
  -or ORGANISM, --organism ORGANISM
                        organism to query against. See g:Profiler
                        documentation for supported organisms (default:
                        hsapiens)
  -dbs DATABASES [DATABASES ...], --databases DATABASES [DATABASES ...]
                        database to use in enrichment analysis (default:
                        ['GO:BP'])
  -ltx, --latex         save latex formatted table (default: False)
  -md, --markdown       save markdown formatted table (default: False)
  -sa START_AT, --start_at START_AT
                        start family enumeration at (default: 0)

[空间转录组] Sepal——识别具有空间模式的基因