2018-08-29用R分析蛋白质组学数据(2):数据过滤和缺失值插补

原文见https://datascienceplus.com/proteomics-data-analysis-2-3-data-filtering-and-missing-value-imputation/

Proteomics Data Analysis (2/3): Data Filtering and Missing Value Imputation

Published on August 28, 2018 at 8:36 am
Welcome to Part Two of the three-part tutorial series on proteomics data analysis. The ultimate goal of this exercise is to identify proteins whose abundance is different bewteen the drug-resistant cells and the control. In other words, we are looking for a list of differentially regulated proteins that may shed light on how cells escape the cancer-killing action of a drug. In Part One, I have demonstrated the steps to acquiring a proteomics data set and performing data pre-processing. We will pick up from the cleaned data set and confront the missing value problem in proteomics.

Again, the outline for this tutorial series is as follows:

  • Data acquisition and cleaning
  • Data filtering and missing value imputation
  • Statistical testing and data interpretation

Missing Value Problem

Although mass spectrometry-based proteomics has the advantage of detecting thousands of proteins from a single experiment, it faces certain challenges. One problem is the presence of missing values in proteomics data. To illustrate this, let's examine the first few rows of the log2-transformed and raw protein abundance values.

library(dplyr)
head(select(df, Gene, starts_with("LOG2")))
##    Gene LOG2.Parental_bR1 LOG2.Parental_bR2 LOG2.Parental_bR3
## 1 RBM47              -Inf              -Inf          21.87748
## 2 ESYT2           25.6019          25.56180          25.68763
## 3 ILVBL              -Inf          20.76474              -Inf
## 4 KLRG2              -Inf              -Inf          22.31786
## 5 CNOT1              -Inf              -Inf              -Inf
## 6   PGP              -Inf              -Inf              -Inf
##   LOG2.Resistant_bR1 LOG2.Resistant_bR2 LOG2.Resistant_bR3
## 1               -Inf               -Inf               -Inf
## 2               -Inf               -Inf               -Inf
## 3               -Inf               -Inf               -Inf
## 4               -Inf               -Inf               -Inf
## 5               -Inf               -Inf           29.14207
## 6               -Inf               -Inf           22.46269
head(select(df, Gene, starts_with("LFQ")))
##    Gene LFQ.intensity.Parental_bR1 LFQ.intensity.Parental_bR2
## 1 RBM47                          0                          0
## 2 ESYT2                   50926000                   49530000
## 3 ILVBL                          0                    1781600
## 4 KLRG2                          0                          0
## 5 CNOT1                          0                          0
## 6   PGP                          0                          0
##   LFQ.intensity.Parental_bR3 LFQ.intensity.Resistant_bR1
## 1                    3852800                           0
## 2                   54044000                           0
## 3                          0                           0
## 4                    5228100                           0
## 5                          0                           0
## 6                          0                           0
##   LFQ.intensity.Resistant_bR2 LFQ.intensity.Resistant_bR3
## 1                           0                           0
## 2                           0                           0
## 3                           0                           0
## 4                           0                           0
## 5                           0                   592430000
## 6                           0                     5780200

It is hard to miss the -Inf, which represent protein intensity measurements of 0 in the raw data set. We consider these data points as missing values, or a lack of quantification in the indicated samples. This is a common issue in proteomics experiments, and it arises due to sample complexity and randomness (or stochasticity) in sampling.

For example, imagine pouring out a bowl of Lucky Charms cereal containing a thousand different marshmallows. Let's say there is only one coveted rainbow marshmallow for every one thousand pieces. The likelihood of your bowl containing the rare shape is disappointingly low. In our situation, there are approximately 20,000 proteins expressed in a given cell, and many in low quantities. Hence, the probability of consistently capturing proteins with low expression across all experiments is small.

Data Filtering

To overcome the missing value problem, we need to remove proteins that are sparsely quantified. The hypothesis is that a protein quantified in only one out of six samples offers insufficient grounds for comparison. In addition, the protein could have been mis-assigned.

One of many filtering schemes is to keep proteins that are quantified in at least two out of three replicates in one condition. To jog your memory, we have two conditions, one drug-resistant cell line and a control, and three replicates each. The significance of replicates will be discussed in Part 3 of the tutorial. For now, we will briefly clean the data frame and apply filtering.

## Data cleaning: Extract columns of interest
df = select(df, Protein, Gene, Protein.name, starts_with("LFQ"), starts_with("LOG2"))
glimpse(df)
## Observations: 1,747
## Variables: 15
## $ Protein                     <chr> "A0AV96", "A0FGR8", "A1L0T0", "A4D...
## $ Gene                        <chr> "RBM47", "ESYT2", "ILVBL", "KLRG2"...
## $ Protein.name                <chr> "RNA-binding protein 47", "Extende...
## $ LFQ.intensity.Parental_bR1  <dbl> 0, 50926000, 0, 0, 0, 0, 0, 0, 0, ...
## $ LFQ.intensity.Parental_bR2  <dbl> 0, 49530000, 1781600, 0, 0, 0, 0, ...
## $ LFQ.intensity.Parental_bR3  <dbl> 3852800, 54044000, 0, 5228100, 0, ...
## $ LFQ.intensity.Resistant_bR1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 8213400...
## $ LFQ.intensity.Resistant_bR2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 6903700...
## $ LFQ.intensity.Resistant_bR3 <dbl> 0, 0, 0, 0, 592430000, 5780200, 0,...
## $ LOG2.Parental_bR1           <dbl> -Inf, 25.60190, -Inf, -Inf, -Inf, ...
## $ LOG2.Parental_bR2           <dbl> -Inf, 25.56180, 20.76474, -Inf, -I...
## $ LOG2.Parental_bR3           <dbl> 21.87748, 25.68763, -Inf, 22.31786...
## $ LOG2.Resistant_bR1          <dbl> -Inf, -Inf, -Inf, -Inf, -Inf, -Inf...
## $ LOG2.Resistant_bR2          <dbl> -Inf, -Inf, -Inf, -Inf, -Inf, -Inf...
## $ LOG2.Resistant_bR3          <dbl> -Inf, -Inf, -Inf, -Inf, 29.14207, ...
## Data filtering function
filter_valids = function(df, conditions, min_count, at_least_one = FALSE) {
  # df = data frame containing LOG2 data for filtering and organized by data type
  # conditions = a character vector dictating the grouping
  # min_count = a numeric vector of the same length as "conditions" indicating the minimum 
  #     number of valid values for each condition for retention
  # at_least_one = TRUE means to keep the row if min_count is met for at least one condition
  #     FALSE means min_count must be met across all conditions for retention

  log2.names = grep("^LOG2", names(df), value = TRUE)   # Extract LOG2 column names
  cond.names = lapply(conditions, # Group column names by conditions
                      function(x) grep(x, log2.names, value = TRUE, perl = TRUE))

  cond.filter = sapply(1:length(cond.names), function(i) {
    df2 = df[cond.names[[i]]]   # Extract columns of interest
    df2 = as.matrix(df2)   # Cast as matrix for the following command
    sums = rowSums(is.finite(df2)) # count the number of valid values for each condition
    sums >= min_count[i]   # Calculates whether min_count requirement is met
  })

  if (at_least_one) {
    df$KEEP = apply(cond.filter, 1, any)
  } else {
    df$KEEP = apply(cond.filter, 1, all)
  }

  return(df)  # No rows are omitted, filter rules are listed in the KEEP column
}


## Apply filtering
df.F = filter_valids(df,
                     conditions = c("Parental", "Resistant"),
                     min_count = c(2, 2),
                     at_least_one = TRUE)

The output data frame df.F is a copy of df with an additional KEEP column indicating the rows to retain. We will complete the filtering using the following operation and then check out the first couple of rows.

df.F = filter(df.F, KEEP)
head(select(df.F, Gene, starts_with("LOG2")))
##    Gene LOG2.Parental_bR1 LOG2.Parental_bR2 LOG2.Parental_bR3
## 1 ESYT2          25.60190          25.56180          25.68763
## 2 EIF3C          26.93022          27.11644          26.83231
## 3 NACAM          27.71299          27.66756          27.53527
## 4 DX39A          25.90933          25.69806          25.93283
## 5  BACH          25.07153          25.39110          25.06027
## 6 MYO1C          27.16471          27.48416          27.43841
##   LOG2.Resistant_bR1 LOG2.Resistant_bR2 LOG2.Resistant_bR3
## 1               -Inf               -Inf               -Inf
## 2           26.29148           26.04087           26.46083
## 3           29.03570           28.68295           28.89753
## 4           26.39331           26.54022               -Inf
## 5               -Inf               -Inf               -Inf
## 6           27.61400           27.02263           27.44530

Notice that the protein in the first row is quantified in the Parental line but not the Resistant one. Proteins like this are of great interest to us as they are likely implicated in the mechanism of drug resistance. In addition, note that the final number of proteins after filtering (1031) is roughly 60% the original number (1747). Filtering reduces our list of proteins to ones quantified in a reasonably consistent manner.

Data Normalization

Before we proceed to imputation, we need to account for technical variability in the amount of sample analyzed by the mass spectrometer from one run to another. This is an issue parallel to the variation in sequencing depth in RNAseq experiments. To normalize out these technical differences, we performed a global median normalization. For each sample, the median of the log2-transformed distribution is subtracted from all the values.

## Data normalization function
median_centering = function(df) {
  # df = data frame containing LOG2 columns for normalization
  LOG2.names = grep("^LOG2", names(df), value = TRUE)

  df[, LOG2.names] = lapply(LOG2.names, 
                            function(x) {
                              LOG2 = df[[x]]
                              LOG2[!is.finite(LOG2)] = NA   # Exclude missing values from median calculation
                              gMedian = median(LOG2, na.rm = TRUE)
                              LOG2 - gMedian
                            }
  )

  return(df)
}


## Normalize data
df.FN = median_centering(df.F)

The result is that each sample is centered at a log2(intensity) of 0.

Data Imputation

After filtering and normalization, some missing values remain. How do we deal with them from here? The statistical approach designed to answer such a question is called imputation. For a thorough discussion of imputation on proteomics data sets, I highly recommend this article in the Journal of Proteome Research.

Since missing values are associated with proteins with low levels of expression, we can substitute the missing values with numbers that are considered “small” in each sample. We can define this statistically by drawing from a normal distribution with a mean that is down-shifted from the sample mean and a standard deviation that is a fraction of the standard deviation of the sample distribution. Here's a function that implements this approach:

## Data imputation function
impute_data = function(df, width = 0.3, downshift = 1.8) {
  # df = data frame containing filtered 
  # Assumes missing data (in df) follows a narrowed and downshifted normal distribution
  # use_keep = filter rows using KEEP column prior to imputation

  LOG2.names = grep("^LOG2", names(df), value = TRUE)
  impute.names = sub("^LOG2", "impute", LOG2.names)

  # Create new columns indicating whether the values are imputed
  df[impute.names] = lapply(LOG2.names, function(x) !is.finite(df[, x]))

  # Imputation
  set.seed(1)
  df[LOG2.names] = lapply(LOG2.names,
                          function(x) {
                            temp = df[[x]]
                            temp[!is.finite(temp)] = NA

                            temp.sd = width * sd(temp[df$KEEP], na.rm = TRUE)   # shrink sd width
                            temp.mean = mean(temp[df$KEEP], na.rm = TRUE) - 
                              downshift * sd(temp[df$KEEP], na.rm = TRUE)   # shift mean of imputed values

                            n.missing = sum(is.na(temp))
                            temp[is.na(temp)] = rnorm(n.missing, mean = temp.mean, sd = temp.sd)                          
                            return(temp)
                          })
  return(df)
}


## Apply imputation
df.FNI = impute_data(df.FN)

Let's graphically evaluate the results by overlaying the distribution of the imputed values over the original distribution. In doing so, we observe that the number of missing values is greater in the resistant condition compared to the control. Furthermore, the missing values take on a narrow spread at the lower end of the sample distribution, which reflects our notion that low levels of protein expression produce missing data.

Summary

This is the second of three tutorials on proteomics data analysis. I have described the approach to handling the missing value problem in proteomics.

In the final tutorial, we are ready to compare protein expression between the drug-resistant and the control lines. This involves performing a two-sample Welch's t-test on our data to extract proteins that are differentially expressed. Moreover, we will discuss ways to interpret the final output of a high-throughput proteomics experiment. Stay tuned for the revelation of proteins that may play a role in driving the resistance of tumor cells.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,591评论 6 501
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,448评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,823评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,204评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,228评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,190评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,078评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,923评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,334评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,550评论 2 333
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,727评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,428评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,022评论 3 326
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,672评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,826评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,734评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,619评论 2 354

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,322评论 0 10
  • 这是我头一回为自传做笔记。曾经的我,看自传就和看小说一样,偶有心得也会很快遗忘。自从看了《穷查理宝典》,我意...
    sohofans阅读 2,404评论 11 7
  • 美丽的隆德城之—— 春压枝头绿含羞 桃苞带红戏雪球
    老巷子的雨阅读 132评论 2 5