fastMNN and harmony

In analyzing scRNA-seq data, batch effects are an influential source of variability. These derive from a range of factors, including the timing of cell capture, the personnel handling the samples, variations in reagent lots, differences in equipment, and even the technological platforms used. These factors can lead to substantial discrepancies in the data collected. Numerous algorithms have been developed to mitigate batch effects. Among them, Harmony and fastMNN are often the ones I employ in my analyses.

1. fastMNN

FastMNN is built upon its more complex predecessor, the Mutual Nearest Neighbors (MNN) algorithm. We'll first discuss MNN (1).
a. First and foremost, MNN assumes that data from two batches rest on parallel hyperplanes in a high-dimensional gene expression space (Figure1a). In this context, the batch effect can be envisioned as (almost orthogonally positioned) vectors between the batches.
b. To discern the vectors representing the batch effect, the MNN algorithm identifies MNN pairs of cells. Consider two batches: For cell i in batch 1, we identify k nearest neighbors from cells in batch 2, likewise, for cell j in batch 2, we locate the k nearest neighbors from cells in batch 1. If one of cell i's nearest neighbors happens to be cell j, and one of cell j's nearest neighbors is cell i, the pair of cells are deemed MNNs. As per the assumptions of the MNN algorithm, these cells would be of the same type. The distances between these cells are calculated as Cosine distances.
Note: A crucial assumption here is that MNN pairs should share the same cell type. Hence, if two datasets do not include equivalent cell types, suitable MNNs cannot be found, which may lead to inappropriate integrations.
c. Upon identifying the MNNs, the discrepancies between the MNNs—representable by vectors—are deemed batch effects. They are consequently used to adjust the data in batch 2. The adjustment is straightforward, as each data point in batch 2, in high-dimensional gene expression space, is adjusted by subtracting the vector of equal length. More specifically, considering that there are multiple MNNs, this generates numerous batch effect correction vectors. The MNN algorithm does not simply average these vectors; it computes a cell-specific batch-correction vector, calculated as a weighted average of these vectors using a Gaussian kernel. Put simply, we have a multitude of vectors calculated from various MNNs, where for a specific cell, the closer an MNN (the MNN member in batch 2) is to it, the higher the weight of its vector.
As outlined above, MNN corrects for batch effects using some important assumptions that should not be violated.
First, it assumes the presence of at least one cell population common to both batches. As discussed earlier, this assumption is key to the functioning of the MNN algorithm.
Second, it supposes that the batch effect is nearly orthogonal to the biological subspace. The authors posit that in high-dimensional space, this is usually the case.
Third, the variation in batch effects across cells is considered much smaller than the variation in the biological effects between different cell types. If this assumption is violated, it leads to uncertainty in identifying correspondingly typed cells from two batches using MNNs.
Given that MNN analyzes all genes' expression data and constructs the high-dimensional space using all genes, it can be quite resource-intensive and time-consuming in identifying MNNs. To circumvent this, fastMNN primarily leverages Principle Component Analysis for dimension reduction and constructs data space using these principle components, significantly improving efficiency.

During our analysis, when attempting to merge epithelial cells from different samples, their considerable heterogeneity results in a violation of the first assumption, leading to unsatisfactory UMAP plots.
Indeed, the concept of Harmony is intriguing, but its implementation is notably complex, requiring extensive mathematical knowledge to fully grasp (2)


Figure 1

2. Harmony

Figure 2

Harmony employs a method known as soft clustering to maintain high batch-diversity within a cluster (3), hence mitigating the batch effect. It deviates from traditional clustering techniques such as k-means, which assigns every cell definitively to a single cluster.

In Harmony's soft clustering approach, each cell is placed in various clusters with certain probabilities. For instance, in conventional k-means clustering, if a cell i belongs to cluster 1, this relationship would be denoted as Ri1=1, and the same cell would not belong to, say, cluster 2, i.e., Ri2=0. In soft clustering, the relationship might be represented as Ri1=0.4 and Ri2=0.2, indicating proportional associations.

In essence, Harmony's goal is to maximize the diversity between batches within individual clusters. A comprehensive understanding of Harmony's algorithm does require a solid mathematical foundation, making it challenging to appreciate in its entirety without prior exposure.

In summary, both the MNN and Harmony methods for correcting batch effects hinge on the presence of common cells across different batches. When dealing with highly heterogeneous cell types, these methods may yield unpredictable results. For instance, LIGER, which has shown robust performance with non-identical cell types, may prove to be a more effective solution for integrating epithelial cells (4).

Reference

  1. Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology. 2018;36(5):421-7.

  2. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods. 2019;16(12):1289-96.

  3. Mao Q, Wang L, Goodison S, Sun Y. Dimensionality Reduction Via Graph Structure Learning2015. 765-74 p.

  4. Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome biology. 2020;21(1):12.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,874评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,102评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,676评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,911评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,937评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,935评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,860评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,660评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,113评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,363评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,506评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,238评论 5 341
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,861评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,486评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,674评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,513评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,426评论 2 352

推荐阅读更多精彩内容