用scanpy
做差异分析的时候,发现一个奇怪的现象,如果指定了reference
参数,即使设置了pts
参数来计算基因的在细胞群里面的表达比例,当提取差异结果的时候会发现并没有参考细胞群的表达比例。一开始还以为是代码哪里出现了问题,但是经过各种确认好像并不是,应该是软件本身的实现过程造成这样的结果。
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon', groups=['0'], reference='1', pts=True)
sc.get.rank_genes_groups_df(adata, group='0')
names scores logfoldchanges pvals pvals_adj pct_nz_group
0 DNAJA3 5.441774 NaN 5.275249e-08 8.079924e-06 1.0
1 ARRDC3 5.357692 NaN 8.429181e-08 1.098348e-05 1.0
2 MDS2 5.346570 2.715877 8.963667e-08 1.098348e-05 1.0
3 SELL 5.176715 NaN 2.258272e-07 2.305947e-05 1.0
4 PDIK1L 5.148776 NaN 2.621915e-07 2.409540e-05 1.0
... ... ... ... ... ... ...
1833 LTB -6.935504 -0.974345 4.047756e-12 1.487955e-09 1.0
1834 S100A6 -7.112744 1.895015 1.137580e-12 5.227180e-10 1.0
1835 IL32 -7.934794 -1.297552 2.108454e-15 1.291779e-12 1.0
1836 LGALS1 -8.337410 0.869385 7.593519e-17 6.978444e-14 1.0
1837 S100A11 -9.815036 3.125621 9.700399e-23 1.782933e-19 1.0
[1838 rows x 6 columns]
当不指定reference
参数时,结果看起来就是正常的,在比较组和参考组里面表达比例都有:
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon', groups=['0'], pts=True)
sc.get.rank_genes_groups_df(adata, group='0')
names scores logfoldchanges pvals pvals_adj pct_nz_group pct_nz_reference
0 LTB 10.773931 NaN 4.570611e-27 2.470818e-25 1.0 1.0
1 IL32 8.496833 NaN 1.948341e-17 5.968417e-16 1.0 1.0
2 HINT1 8.042960 NaN 8.769394e-16 2.238631e-14 1.0 1.0
3 SELL 7.390066 NaN 1.467558e-13 3.249846e-12 1.0 1.0
4 RP11-291B21.2 7.175574 NaN 7.200434e-13 1.454330e-11 1.0 1.0
... ... ... ... ... ... ... ...
1833 LGALS1 -19.655756 NaN 5.161368e-86 1.897319e-83 1.0 1.0
1834 HLA-DPB1 -19.723131 NaN 1.365123e-86 6.272742e-84 1.0 1.0
1835 HLA-DPA1 -20.134163 NaN 3.705156e-90 2.270025e-87 1.0 1.0
1836 HLA-DRB1 -21.678505 NaN 3.273373e-104 3.008230e-101 1.0 1.0
1837 CYBA -21.769262 NaN 4.538768e-105 8.342256e-102 1.0 1.0
[1838 rows x 7 columns]
虽然是一场虚惊,还是觉得有些别扭,毕竟结果看起来有些反常,不知所以时为了确认是否有问题还要花费额外的时间。说起来,软件内部应该自动屏蔽这样的参数改动,返回相似的结果。用过Seurat
的都知道,这样的想象是不存在的。
在这种情况下,scanpy
其实已经计算了参考组里面的基因表达比例,只是在提取的时候却忽略了。既然是丢了,如果需要想要找回的话,也是挺简单的:
import pandas as pd
dge = sc.get.rank_genes_groups_df(adata, group='0')
refpts = pd.DataFrame({'pct_nz_reference':adata.uns['rank_genes_groups']['pts'].loc[:,'1']})
refpts['names'] = refpts.index
res = pd.merge(dge, refpts, how='left', on='names')
res
names scores logfoldchanges pvals pvals_adj pct_nz_group pct_nz_reference
0 DNAJA3 5.441774 NaN 5.275249e-08 8.079924e-06 1.0 1.0
1 ARRDC3 5.357692 NaN 8.429181e-08 1.098348e-05 1.0 1.0
2 MDS2 5.346570 2.715877 8.963667e-08 1.098348e-05 1.0 1.0
3 SELL 5.176715 NaN 2.258272e-07 2.305947e-05 1.0 1.0
4 PDIK1L 5.148776 NaN 2.621915e-07 2.409540e-05 1.0 1.0
... ... ... ... ... ... ... ...
1833 LTB -6.935504 -0.974345 4.047756e-12 1.487955e-09 1.0 1.0
1834 S100A6 -7.112744 1.895015 1.137580e-12 5.227180e-10 1.0 1.0
1835 IL32 -7.934794 -1.297552 2.108454e-15 1.291779e-12 1.0 1.0
1836 LGALS1 -8.337410 0.869385 7.593519e-17 6.978444e-14 1.0 1.0
1837 S100A11 -9.815036 3.125621 9.700399e-23 1.782933e-19 1.0 1.0