Mapping Distinct Bone Marrow Niche Populations and Their Differentiation Paths
骨髓龛中不同细胞群体的关联及其分化途径
文章信息介绍的文章利用单细胞转录组分析骨髓龛中不同细胞类群间的关联及其分化轨迹中不同的转录调控因子的功能。文章于2019年3月7日发表在Cell 杂志。于2019年7月7日发表在Cell reports 杂志。文章标题是:Mapping Distinct Bone Marrow Niche Populations and Their Differentiation Paths;DOI:https://doi.org/10.1016/j.celrep.2019.06.031。文章连接
Finally, we validated our findings using lineage-specific reporter strains and targeted knockdowns. Our analysis reveals differentiation hierarchies for maturing stromal cells, determines key transcription factors along these trajectories, and provides an understanding of the complexity of the bone marrow microenvironment.
摘要
骨髓微环境是由复杂表型和细胞成熟轨迹未知的异质性的非造血细胞群体组成。在这些非造血细胞群体中,间充质细胞维持产生间质细胞,骨细胞、脂肪细胞和软骨细胞。阐明骨髓内这些独特的细胞亚群仍然具有挑战性。本研究中,我们使用非造血骨髓细胞的单细胞RNA测序来定义特定的细胞亚群。此外,通过结合细胞状态层次的计算预测和已知的关键转录因子的表达,我们绘制了针对骨细胞,软骨细胞和脂肪细胞谱系的分化途径。 最后,我们使用谱系特异性报告菌株和靶向敲除来验证我们发现的基因。 我们的分析揭示了成熟基质细胞的分化层次,确定了伴随这些分化轨迹的关键转录因子,并增加了对骨髓微环境复杂性的理解。
测序数据介绍
For scRNA-seq, we used inDrops (Klein et al., 2015) following a previously described protocol (Zilionis et al., 2017) with the following modifications: the sequence of the primer on the hydrogel beads was 5′-CGATGACGTAATACGACTCACTATAGGGTGTCGGGTGCAG[bc1,8nt]GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG[bc2,8nt]NNNNNNTTTTTTTTTTTTTTTTTTTV-3′; the sequence of the PE2-N6 primer (step 151 in (Zilionis et al., 2017) was 5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGNNNNNN-3′; and the sequences of the PCR primers (steps 157 and 160 in (Zilionis et al., 2017)) were 5′-AATGATACGGCGACCACCGAGATCTACACXXXXXXXXTCGTCGGCAGCGTC-3′ and 5′-CAAGCAGAAGACGGCATACGAGATGGGTGTCGGGTGCAG-3′. Following droplet barcoding reverse transcription, emulsions were split into aliquots of approximately 1,000 (in vivo samples) or 3,000 (cultured samples) single-cell transcriptomes and frozen at −80°C. For the in vivo samples, two libraries (n = 1,533 cells total) were prepared for mouse 1 and three libraries (n = 3,574 cells total) were prepared for mouse 2. For the three cultured samples, one library per sample was prepared (n = 2,837, n = 2,164, and n = 2,520 total cells, respectively). These cell numbers correspond to the final number of transcriptomes detected upon removal of background barcodes and stressed or dying cells (see section below).
数据分析情况
数据过滤:统计每个文库数据所有独特的12 bp的cell barcodes(CBs),排除出现次数小于100次的CBs,并过滤掉CBs包含八个相同核苷酸的片段。然后,排除CBs相关的随机UMI(随机相关定义为任何一个碱基的UMI不超过90%)
比对: hg38 genome使用STAR(2.5.3)软件
基因定量:对转录本使用“-‘‘–quantMode TranscriptomeSAM”进行定量。生成包含每个细胞和每个基因的UMI数量的表达矩阵。
细胞过滤:下游分析的要求细胞至少具有1,000个UMIs比对到至少500个独特基因。我们还排除了含有超过20%线粒体或者核糖体基因的细胞。
聚类分析:质控后获得7698个正常的骨髓细胞(normal BM)。随机过滤掉BM5 CD38+CD38-的1590个细胞中的783个细胞,来降低CD38+CD38-细胞对细胞群体的代表性。剩余的6915个正常的骨髓细胞被用来聚类形成细胞类型使用BackSPIN无监督进行细胞类型聚类。BackSPIN clustering (优点:可以克服细胞和基因同时聚类的困难).
可视化:使用KNN和t-SNE两种方法对细胞进行可视化分析。
表达矩阵可以下载:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132151
Sequencing and read mapping
All libraries were sequenced on two runs of a NextSeq 500 (one for the in vivo samples and one for cultured samples). Raw sequencing data (FASTQ files) were processed using the previously described (
) inDrops.py bioinformatics pipeline (available at https://github.com/indrops/indrops), with a few modifications: Bowtie v.1.1.1 was used with parameter -e 80; all ambiguously mapped reads were excluded from analysis; and reads were aligned to the Ensembl release 85 Mus musculus GRCm38 reference.
Quantification and Statistical Analysis of scRNA-seq
Cell filtering and normalization
Each library was initially filtered to include only abundant barcodes (> 500 total counts for all in vivo libraries; > 800 counts for cultured sample 1; > 1,000 counts for cultured samples 2 and 3), based on visual inspection of the histograms of total transcript counts per cell barcode. Next, we excluded putatively stressed or dying cells with > 30% (in vivo samples) or > 20% (cultured samples) of their transcripts coming from mitochondrial genes.
The gene expression counts of each cell were then normalized using a variant of total-count normalization that avoids distortion from very highly expressed genes. Specifically, we calculated, the normalized transcript counts for gene j in cell i, from the raw counts
and [图片上传失败...(image-cd9a1-1564118858268)]
is the average of [图片上传失败...(image-e1fc15-1564118858268)]
over all cells. To prevent very highly expressed genes from correspondingly decreasing the relative expression of other genes, we excluded genes comprising > 5% of the total counts of any cell when calculatingIn order to focus on heterogeneity within the stromal cell population, we first clustered the data and excluded hematopoietic and endothelial cell clusters based on their expression of previously known marker genes, as well as putative cell doublets. Specifically, we identified genes that were highly variable (top 25% by v-score (
), a measure of above-Poisson noise) and expressed at reasonably high levels (at least 3 counts in at least 5 cells). The counts for these genes were z-score normalized and used to perform principal components analysis (PCA), keeping the top 35 dimensions. After PCA, a k-nearest-neighbor (kNN) graph (k = 4) was constructed by connecting each cell to its four nearest neighbors, using Euclidean distance in the principal component space. Finally, we applied spectral clustering (scikit-learn SpectralClustering function with assign_labels = ’discretize’) to the kNN graph and visualized the clustering by projecting the graph into two dimensions using a force-directed graph layout (SPRING(
)).
We then identified enriched genes in each cluster and assigned cell type labels based on well-characterized cell type-specific marker genes (Figure S1D). Using this approach, we excluded putative endothelial cells, granulocytes, lymphoid progenitors, megakaryocytes, and erythroid progenitors.
For the in vivo samples, we also used Scrublet(
) to identify two clusters of cell doublets that co-expressed marker genes of distinct cell types. 142 putative doublets were excluded.
Clustering and visualization of stromal cells
We repeated cell clustering and visualization using only the non-hematopoietic, non-endothelial clusters. Gene filtering, PCA, and kNN graph construction were performed as above, except only the top 25 principal components were used, and only seven spectral clusters were generated.
Permutation test for gene enrichment
To find significantly enriched genes in each cell cluster, we used a parameter-free permutation-based test to calculate p values, with the difference in means as the test statistic (
). We accounted for multiple hypotheses testing with a false discovery rate of 5% using the Benjamini-Hochberg procedure (
). To be considered for differential gene expression analysis, genes had to be expressed by at least 5 of the cells in the cluster of interest.
Gene set enrichment analysis (GSEA)
We used the online GSEA tool (http://software.broadinstitute.org/gsea/login.jsp)(
,
) to find terms enriched in cluster-specific genes. As input, we used significantly enriched genes with > 2-fold higher average expression (adding a pseudocount of 0.1 transcript counts) in the cluster of interest compared to the remaining cells. The following gene set collections were tested: H (hallmark), C2 (curated), and C5 (Gene Ontology).
Population balance analysis (PBA)
The PBA algorithm calculates a scalar “potential” for each cell that is analogous to a distance, or pseudotime, from an undifferentiated source, and a vector of fate probabilities that indicate the distance to fate branch points. These fate probabilities and temporal ordering were computed using the Python implementation of PBA (available online https://github.com/AllonKleinLab/PBA), as described(
).
The inputs to the PBA scripts are a set of comma-separated value (.csv) files encoding: the edge list of a kNN graph (k = 50) of the cell transcriptomes (A.csv); a vector assigning a net source/sink rate to each graph node (R.csv); and a lineage-specific binary matrix identifying the subset of graph nodes that reside at the tips of branches (S.csv). These files are provided online at http://kleintools.hms.harvard.edu/paper_websites/bone_marrow_stroma/. PBA is then run according to the following steps:
-
(1)
Apply the script ‘compute_Linv.py -e A.csv’, here inputting edges (flag ‘-e’) from the SPRING kNN graph (see above). This step outputs the random-walk graph Laplacian, Linv.npy.
-
(2)
Apply the script ‘compute_potential.py -L Linv.npy -R R.csv’, here inputting the inverse graph Laplacian (flag ‘-L’) computed in step (1) and the net source/sink rate to each graph node (flag ‘-R’). This step outputs a potential vector (V.npy) that is used for temporal ordering (cells ordered from high to low potential).
-
(3)
Apply the script ‘compute_fate_probabilities.py -S S.csv -V V.npy -e A.csv -D 1’, here inputting the lineage-specific exit rate matrix (flag ‘-S’), the potential (flag ‘-V’) computed in step (2), the same edges (flag ‘-e’) used in step (1) and a diffusion constant (flat ‘-D’) of 1. This step outputs fate probabilities for each cell.
Estimation of net source/sink rate vector R
A complete definition of the vector R in terms of biophysical quantities has been published previously(
). We assigned negative values to R for the five cells with the highest expression of marker genes for each of the three terminal lineages. Specifically, for each lineage, we identified genes enriched in the most mature cell cluster (cluster 1 for adipocytes, cluster 6 for osteoblasts, and cluster 7 for chondrocytes), keeping genes expressed in > 25% of cells with an average expression level of > 0.5 transcript counts and > 2-fold higher average expression within the cluster than in the rest of the cells. We then identified the five cells with the highest average z-score normalized expression of these marker genes. We used the same procedure to identify ten starting cells (cells with highest score of cluster 3 [MSC] genes). We assigned different exit rates to each of the three lineages using a fitting procedure that ensured that cells identified as the putative starting MSCs would have a uniform probability to become each fate. We assigned a single positive value to all remaining cells, with the value chosen to enforce the steady-state condition
. In the fitting procedure, all exit were initially set to one and iteratively incremented or decremented until the average fate probabilities of the putative starting MSCs were within 1% of uniform. The separate lineage exit rates were then used to form the lineage-specific exit rate matrix S.
Extracting and ordering cells for each lineage
To isolate the differentiation trajectory for each lineage (adipocyte, osteoblast, and chondrocyte), we ordered cells on the basis of their graph distance from the earliest predicted MSC progenitors, keeping only cells for which the probability of the given fate increased or remained constant with graph distance. Graph distance was measured by PBA potential, and starting with the cell closest to the MSC origin, we added the cell with next highest potential to the trajectory if the PBA-predicted lineage probability for cell i was at least 99.5% of the average lineage probability of the cell(s) already in the trajectory.
More formally, the procedure is as follows: order all N cells in the experiment from highest to lowest PBA potential V, with decreasing potential corresponding to increasing distance from MSCs. Let Ei be an indicator variable for the membership of ordered cell i in the erythroid trajectory (Ei = 1 if cell i is in the trajectory; otherwise, Ei = 0). If Pi is the PBA-predicted lineage probability for ordered cell i, then Ei = 1 if
Cells on a given lineage’s trajectory were then ordered by decreasing potential. Defining tj as the index of the jth cell on a given trajectory,
Throughout this paper, we report this cell order (akin to the “pseudotime” in other publications) as a percentage of ordered cells, with the first, least differentiated cell at 0% and the most mature cell at 100%.
Significant dynamic genes
To find genes with significant changes in expression across each lineage’s cell ordering, we used a modified version of permutation test described above (see “Permutation test for gene enrichment”). Specifically, we applied a sliding window (n = 50 cells) to the cell ordering and used the difference in means between the windows with the highest and lowest expression as the test statistic, comparing the observed difference to the differences obtained after permuting the cell ordering. To be considered for analysis, genes were required to have a mean expression of at least 0.01 transcript counts in the input cells.
Dynamic gene clustering
To find groups of genes with similar expression patterns along each lineage’s differentiation ordering, we clustered the smoothed expression traces for all significantly variable genes (see previous section) with at least two-fold change between the windows with minimum and maximum expression. In detail, we smoothed the gene expression traces using a Gaussian kernel (σ = 10% of cell ordering), z-score normalized the smoothed traces, and clustered the traces using k-means clustering.
Mapping cultured cell transcriptomes to freshly isolated cell data
For Figures 5A and S5B, cells from the cultured samples were projected into the same principal component space as the in vivo data, then mapped to their most similar in vivo neighbors. In detail, counts were first converted to TPM for all samples. Then, using only the in vivo cells, the top 25% most variable genes (measured by v-score) with at least three transcript counts in at least five cells were z-score normalized and used to find the top 35 principal components. Next, the cultured cells were z-score normalized using the gene expression means and s.d. from the in vivo data and transformed into the in vivo principal component space. Lastly, each cultured cell was mapped to its closest in vivo neighbor in principal component space (Euclidean distance). In the visualization in Figure 5A, the number of cultured cells mapping to each in vivo cell was smoothed over the kNN graph (see section “Smoothing over the kNN graph”). For Figure S5B, we compared cells in the in vivo MSC cluster to the cultured cells mapping to them.
Smoothing over the kNN graph
We smoothed data over the kNN graph for visualization of the density of cultured cells mapping to in vivo cells (Figure 5A). Smoothing was performed by diffusing the number of mapped cells over the graph, as described. In brief, if G is the kNN graph, then the smoothing operator S is
, where L is the Laplacian matrix of G,
, and expm is the matrix exponential. Then the smoothed vector
of a vector of raw values X (number of mapped cells) is
RNA velocity
In order to generate the input for Velocyto (v0.17.13) (
), which requires annotation of exons and introns for read alignments, the raw reads were reprocessed using dropEst (v0.8.5) (
). We first ran droptag with the default parameters, then aligned reads to the mouse genome (mm10) using STAR (v2.7.0a) (
), allowing unique alignments only (‘–outSAMmultNmax 1’). Then dropEst was run with default settings, aside from the following: ‘-m -V -b -F -L eiEIBA’. Cell barcodes were error-corrected using the Velocyto ‘dropest-bc-correct’ command, followed by generation of Velocyto loom files using ‘run-dropest’.
Velocyto.py was run following an example notebook (https://github.com/velocyto-team/velocyto-notebooks/blob/master/python/DentateGyrus.ipynb). Briefly, the loom files generated by dropEst were merged and then filtered to include cell barcodes used in this paper’s other analyses (see ‘Cell filtering and normalization’ section). Following gene filtering (3000 most variable genes with a minimum of 3 counts and detection in 3 cells), spliced and unspliced counts were normalized separately based on total counts per cell, with a target size of the mean total counts across cells. PCA was run using 33 components, followed by KNN imputation with 66 neighbors. Gamma fitting, RNA velocity calculations, and Markov process simulations were conducted as in the Dentate Gyrus example (code for the full analysis is available at kleintools.hms.harvard.edu/paper_websites/bone_marrow_stroma).
Monocle
Monocle (v2.10.1) (
) was run on our normalized counts matrix. Using the same gene filter as in the other analyses (see ‘Cell filtering and normalization’ section), we generated an embedding using the reduceDimension() function (‘max_components = 2, method = ”DDRTree” norm_method = ”none,” pseudo_expr = 0, relative_expr = FALSE, scaling = TRUE’). After generating an initial ordering with the orderCells() function, we identified the state corresponding to MSCs and re-ran orderCells() using this state as the root state. To compare the Monocle osteoblast cell ordering to that of PBA, we selected cells in the MSC and osteoblast states of the Monocle embedding and then ordered cells by Monocle pseudotime.
主要分群情况
识别正常骨髓样品种的细胞类群
AML肿瘤生态系统的单细胞分析
临床意义
在本报告中,我们提供了转录事件的关键见解,这些转录事件指导成骨细胞,软骨细胞和脂肪细胞从基质细胞分化。这里产生的scRNA-seq基因表达谱允许实时描述与骨髓微环境内的命运选择相关的动态过程。该研究的主要结果是三种不同分化途径的详细表征。我们确定了每条路径中的中间体,这些中间体可以通过前瞻性选择来测试其谱系潜力。最后,我们发现这些种群与命运标记的谱系一致,并且它们预测的分化潜能在培养中被概括。
近年来在稳态和疾病期间表征基质群体方面取得了重大进展,我们的研究为更好地理解调节骨髓微环境细胞分化的转录网络提供了一个景观(Hoggatt等,2016,Méndez- Ferrer等,2010,Mercier等,2011,Morrison和Scadden,2014,Tikhonova等,2019,Baryawno等,2019)。基质细胞的失调与几种病理生理过程有关,例如肥胖,骨质减少,骨质疏松症,癌症,牙齿脱落和衰老(Engblom等,2017,Medyouf等,2014,Mendelson和Frenette,2014,Raaijmakers等)。 al。,2010,Zambetti et al。,2016)。因此,理解调节基质细胞分化的机制可以提高对这些疾病的发病机理的理解,并最终导致新的治疗方法。
我们的伪时间结果以及转录本验证支持亚群之间的关系,并允许我们探索基质细胞表型的转录层次。即使scRNA-seq工具及其许多应用的扩展,转录组快照也不能提供细胞状态的完整图像(Cieślik和Chinnaiyan,2018,Kumar等,2017)。因此,采用多模态分析将有助于增强理解,因为这些技术具有测量多种分子表型的能力。此外,理解MSCs命运决定的进展将需要使用炎症和疾病模型以及基于我们初步发现的患者样本进行更深入的研究。我们在这里显示了使用命运映射和报告菌株验证scRNA-seq数据的重要性。可以进一步利用相同的模型来发现特定转录因子在扰动期间谱系承诺决策中的重要性。
总之,该研究提供了对骨髓微环境的细胞组成和沿成骨细胞,软骨细胞和脂肪细胞命运的分化途径的转录中间体的重要见解。此外,尽管体内和培养的基质细胞之间存在差异,但体外分化实验证明可用于测定不同基质命运的转录因子相关性。我们的数据集和分析(kleintools.hms.harvard.edu/paper_websites/bone_marrow_stroma)将作为研究基质细胞分化的未来研究的资源。
(注:本文小编仅仅关注单细胞转录组,原文的单细胞基因分型也很有意义)
在本报告中,我们提供了关键的洞察转录事件,直接成骨细胞,软骨细胞,脂肪细胞分化为基质细胞。这里生成的scrna-seq基因表达谱可以实时描述骨髓微环境中与命运选择相关的动态过程。这项研究的主要结果是详细描述了三种不同的微分路径。我们确定了沿着每一条路径的中间产物,这些中间产物可以被前瞻性地选择来测试它们的谱系潜能。最后,我们发现这些群体与命运标记的谱系是一致的,并且他们预测的分化潜能在文化中得到了概括。
近年来,在稳定状态和疾病期间基质种群的特征化方面取得了重大进展,我们的研究为更好地理解调节骨髓微环境细胞分化的转录网络提供了一个前景(Hoggatt等人2016年,M_ndez-Ferrer等人,2010年,Mercier等人,2011年,Morrison and Scadden,2014年,Tikhonova等人,2019年,Baryawno等人,2019年)。基质细胞的失调与一些病理生理过程有关,如肥胖、骨质减少、骨质疏松、癌症、牙齿脱落和衰老(Engblom等人,2017年,Medyouf等人,2014年,Mendelson和Frenette,2014年,Raaijmacers等人,2010年,Zambetti等人,2016年)。因此,了解调节间质细胞分化的机制可能导致对这些疾病的发病机理的进一步了解,并最终获得新的治疗方法。
我们的假时间结果以及转录验证支持亚群之间的关系,并允许我们探索基质细胞表型的转录层次。即使scrna-seq工具及其许多应用程序的扩展,转录快照也无法提供细胞状态的完整图片(Cie_lik和Chinnaiyan,2018年,Kumar等人,2017年)。因此,采用多模分析将有助于获得更深入的理解,因为这类技术具有测量多分子表型的能力。此外,了解MSC命运决定的进展需要更深入的研究,利用炎症和疾病模型以及基于我们最初发现的患者样本。我们在这里展示了用命运图和报告菌株验证scrna-seq数据的重要性。同样的模型也可以在微扰过程中进一步研究特定转录因子在谱系承诺决策中的意义。
总之,本研究对骨髓微环境的细胞组成和向成骨细胞、软骨细胞和脂肪细胞脂肪的分化途径的转录中间产物提供了重要的见解。此外,尽管体内和培养的基质细胞之间存在差异,但体外分化实验证明有助于分析转录因子与不同基质命运的相关性。我们的数据集和分析(kleintools.hms.harvard.edu/paper-websites/bone-mallow-stroma)将作为未来研究基质细胞分化的资源。