四月week3文献阅读:Platform-independent approach for cancer detection from gene expression profiles of peripheral blood cells
从外周血细胞基因表达谱检测癌症的与平台无关的方法
Abstract
-
Peripheral blood gene expression intensity-based methods for distinguishing healthy individuals from cancer patients are limited by sensitivity to batch effects and data normalization and variability between expression profiling assays.
基于外周血基因表达强度来区分健康个体和癌症患者的方法受限于对批次效应的敏感性以及表达谱分析之间的数据标准化和变异性。
-
To improve the robustness and precision of blood gene expression-based tumour detection, it is necessary to perform molecular diagnostic tests using a more stable approach.
为了提高基于血液基因表达的肿瘤检测的鲁棒性和准确性,有必要采用更稳定的方法进行分子诊断试验。
-
Taking breast cancer as an example, we propose a machine learning–based framework that distinguishes breast cancer patients from healthy subjects by pairwise rank transformation of gene expression intensity in each sample.
以乳腺癌为例,我们提出了一个基于机器学习的框架,通过对每个样本中基因表达强度的成对秩变换来区分乳腺癌患者和健康受试者。
-
We showed the diagnostic potential of the method by performing RNA-seq for 37 peripheral blood samples from breast cancer patients and by collecting RNA-seq data from healthy donors in Genotype-Tissue Expression project and microarray mRNA expression datasets in Gene Expression Omnibus.
通过在基因型组织表达项目中收集健康献血者的RNA-seq数据,在基因表达综合项目中收集微阵列mRNA表达数据集,对37例乳腺癌患者外周血标本进行RNA-seq分析,显示了该方法的诊断潜力。
-
The framework was insensitive to experimental batch effects and data normalization, and it can be simultaneously applied to new sample prediction.
该框架对实验批处理效果和数据归一化不敏感,可同时应用于新样本预测。
(机器学习前的数据预处理,基因表达强度的成对秩变换,得到的基因表达矩阵用于机器学习,避免了批次效应的敏感性,表达谱分析之间的数据标准化和变异性,该框架对实验批处理效果和数据归一化不敏感,可同时应用于新样本预测)
Introduction
-
Cancer is a systemic disease associated with the perturbation of blood homeostasis resulting in detectable alterations in gene expression in erythrocytes [13], circulating leukocytes [14, 15] and tumour-educated platelets [16] that have potential applicability to cancer diagnostics.
癌症是一种系统性疾病,与血液稳态的扰动有关,可检测到红细胞[13]、循环白细胞[14,15]和受肿瘤影响到的血小板[16]基因表达的改变,这些改变可能适用于癌症诊断。
-
Previous studies have compared gene expression profiles in blood cells from cancer patients and healthy controls either by the direct detection of differentially expressed genes or by a machine learning approach, such as the support vector machine (SVM).
以往的研究比较了癌症患者和健康对照者血细胞中的基因表达谱,方法要么是直接检测差异表达的基因,要么是采用机器学习方法,如支持向量机(SVM)。
-
However, current normalization approaches have limited capacity for batch-effect correction and may even distort biological signals [17, 18].
然而,目前的归一化方法对批量效应校正能力有限,甚至可能扭曲生物信号[17,18]。
Therefore, signature genes or models established in a specific study cannot be directly transferred to other datasets, which hinders the applicability of public data as well as model cross-validation.
因此,在特定研究中建立的签名基因或模型不能直接转移到其他数据集,阻碍了公共数据的适用性以及模型的交叉验证。
To overcome this problem, the use of gene expression order has been proposed as an alternative to signal intensity since it is more stable against outliers, batch effects and different normalization algorithms [19–21] for tissue in a particular state.
为了克服这一问题,人们提出用基因表达顺序来代替信号强度,因为它对异常值、批次效应和不同的归一化算法对特定状态的组织更稳定[19-21]。
(现有的机器学习算法中,是识别差异表达基因,在归一化的方法上有局限,建立起的签名基因也有局限)
Given its superiority to absolute quantification methods, we propose a rank-based machine learning method to distinguish breast cancer and healthy donor blood samples and to investigate its potential for blood-based companion diagnostics (Figure 1)
鉴于绝对量化方法的优越性,我们提出了一种基于秩的机器学习方法来区分乳腺癌和健康供体血液样本,并研究其作为基于血液的辅助诊断的潜力(图1)
-
Figure 1.
图1所示。
-
Schematic overview of the relative expression-based model for liquid biopsies.
基于相对表达的液体活检模型概述。
-
First, expression intensity from either microarray or RNA-seq is pre-processed for pairwise comparison between any of the two genes.
首先,从微阵列或RNA-seq的表达强度进行预处理,以便对两个基因中的任何一个进行两两比较。
-
Then, the relative expression value (0/1) is recorded as newvariables are presented for dimensionality reduction.
然后,将相对表达式值(0/1)记录为表示降维的新变量。
-
Lastly, a prediction model was constructed based on relative expression value
最后,建立了基于相对表达式值的预测模型
Methods
Sample collection
Whole transcriptome sequencing and analysis
全转录组测序与分析
-
After total RNA extraction, mRNA libraries were constructed using the Illumina mRNA-Seq library preparation kit according to the manufacturer’s protocol and 2 × 150 bp paired-end runs were performed on a Novaseq System.
总RNA提取完成后,按照制造商协议使用Illumina mRNA- seq文库制备试剂盒构建mRNA文库,并在Novaseq系统上进行2×150 bp双端运行。
-
Sequencing to the manufacturer’s protocol and 2 × 150 bp paired-end runs were performed on a Novaseq System.
在Novaseq系统上按照制造商协议进行测序,并进行2×150 bp的双端运行。
-
Sequencing quality analysis of the raw data was performed using FASTQC software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc).
使用FASTQC软件对原始数据进行测序质量分析(http://www.bioinformatics.babraham.ac.uk/projects/fastqc)。
-
The human GRCh37 reference genome was downloaded from iGenome (http://ccb.jhu.edu/software/tophat/igenomes. shtml), and the associated .GTF files were downloaded from the Ensembl website (http://asia.ensembl.org/index.html).
人类GRCh37参考基因组从iGenome下载(http://ccb.jhu.edu/software/tophat/igenomes. shtml),相关的. gtf文件从Ensembl网站下载(http://asia.bl.org/index.html)。
-
Reads were then aligned to the reference genome using HISAT2 with default parameters, and the aligned reads were assembled and quantified by StringTie software [22].
然后使用默认参数HISAT2将读序列对齐到参考基因组,并使用StringTie软件[22]组装和量化对齐的读序列。
-
The gene expression was represented by the fragments per kilobase of exon model per million mapped reads value of each sample.
基因表达量以每千位外显子模型每百万图谱读值的片段数表示。
-
The RNA-seq raw data aredeposited in the Genome Sequence Archive [8]with accession number PRJCA001108 (private link for review: http://bigd.big.ac.cn/gsa/s/qYELd91n).
RNA-seq原始数据保存在基因组序列存档[8]中,加入号为PRJCA001108(私有链接:http://bigd.big.ac.cn/gsa/s/qYELd91n)。
Data and pre-processing
-
Multiple gene expression datasets generated from the ABI (ABI Human Genome Survey Microarray v.2) and Affymetrix (Affymetrix Human Exon 1.0 ST Array) platforms, Illumina HiSeq 2500 and Broad Institute Human L1000 epsilon were downloaded from Gene Expression Omnibus (GEO) [23](Table 1).
从gene expression Omnibus (GEO)[23]下载ABI (ABI Human Genome Survey Microarray v.2)和Affymetrix (Affymetrix Human Exon 1.0 ST Array)平台、Illumina HiSeq 2500和Broad Institute Human L1000 epsilon生成的多个基因表达数据集(表1)。
-
For microarray data, we used processed data from GEO.
对于微阵列数据,我们使用来自GEO的处理数据。
-
For RNA-seq data, GSE68086 read counts were normalized to reads per kilobase oftranscript per millionmapped reads (RPKM) values and GSE92743 RPKM values were downloaded from the GTEx website.
对于RNA-seq数据,GSE68086读计数被规范化为每千位副本每百万映射读(RPKM)值的读,GSE92743 RPKM值从GTEx网站下载。
Transformation of gene expression intensity to rank information
基因表达强度向等级信息的转化
-
For each sample, we carried out pairwise comparisons of expres-sion values of all genes.
对于每个样本,我们对所有基因的表达值进行两两比较。
-
For each gene pair (Gi, Gj), the rank comparison, denoted as Gij, should be 1 (Gi > Gj)or0(Gi ≤ Gj).
对于每个基因对(Gi, Gj),秩比为1 (Gi > Gj)或0(Gi≤Gj),表示为Gij。
where n is the total number of genes expressed in a sample.
其中n为样本中表达的基因总数.
Dimension reduction
We first treated the rank values (Gij) as univariate features and selected the highest scoring percentage of features according to the one-way analysis ofvariance F-value.
我们首先将秩值(Gij)作为单变量特征,根据单因素方差分析f值,选取特征得分最高的百分比
Univariate feature selection with the F-test examines each feature individually to determine the strength of the relationship of the feature with the target class.
使用F-test进行单变量特征选择,分别检查每个特征,以确定特征与目标类之间的关系的强度。
The advantages of this feature selection method are that it easily scales to very high-dimensional datasets and that it is computationally simple and fast.
这种特征选择方法的优点是易于扩展到非常高维的数据集,并且计算简单和快速。
However, in this method, each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification performance.
但是,在这种方法中,每个特征都是单独考虑的,因此忽略了特征之间的依赖关系,这可能会导致更差的分类性能。
Therefore, we used it to pre-reduce the search space and then applied more complicated feature selection methods in the next two steps to select stronger features.
因此,在接下来的两个步骤中,我们使用它来预先减少搜索空间,然后应用更复杂的特征选择方法来选择更强的特征。
We then used ElasticNet to further select important features by taking advantage of L1 and L2 regularization.
然后利用L1和L2正则化,利用ElasticNet进一步选择重要的特征。
The objective function is
目标函数是
(使用F-test进行单变量特征选择,分别检查每个特征,以确定特征与目标类之间的关系的强度,通过减少模型空间搜索。确定挑选模型的目标函数,利用复杂的特征选择方法挑选更强的特征。)
The function
is the elastic net penalty, which is a convex combination of the lasso and ridge penalties.
函数是弹性网罚,它是 lasso 和ridge的凸组合。
For all α ∈ [0, 1], the elastic net penalty function is singular (without first derivative) at 0 and it is strictly convex for all α> 0, thus having the characteristics of both lasso and ridge regression.
对于所有α∈[0,1],elastic net penalty 函数在0是单数(没有一阶导数)在0和它是严格凸 对α> 0,因此 具有lasso 和ridge 的特点。
For α =1and α = 0, the penalty is L1 and L2, respectively.
α= 1和α= 0,分别处罚是L1和L2。
β is the coefficient of a vector that is estimated by minimizing the objective function.
β系数是一个向量估计通过最小化目标函数。
The parameter λ is a fixed non-negative constant that multiplies the penalty terms.
参数λ是一个固定的非负常数,繁殖的惩罚条款。
We optimized parameters λ and α by using 10-fold cross-validation on the training data.
我们优化参数λ和α通过使用10倍交叉验证训练数据。
Then, we collected features whose coefficients were not zero.
然后,我们收集系数不为零的特征。
Finally, we used a randomized logistic regression method for the selection of stable features, which is suitable for classification tasks, especially in a case where feature selection or model selection is unstable due to a high dimensionality.
最后,我们使用随机逻辑回归方法来选择稳定的特征,这种方法适用于分类任务,特别是在特征选择或模型选择由于高维数而不稳定的情况下。
The method is a stability selection technique, which works by fitting the L1-penalized logistic regression model hundreds of times with perturbed data (75% subsampling and randomized regularization coefficient for each variable).
该方法是一种稳定性选择技术,其工作原理是用扰动数据(75%的子抽样和每个变量的随机正则化系数)对 L1-penalized logistic 回归模型进行数百次拟合。
Consider data (Xi, Yi), i = 1, ..., N, with univariate response variable Y and p-dimensional covariates X. The model is defined
考虑数据(Xi,Yi), i = 1,…, N,以单变量响应变量Y和p维协变量x定义模型
f(z) = log(1 + exp(-z)) is the logistic regression model.Where, b ∈{−1, 1}, w is the coefficient vector and v is the intercept.
f(z) = log(1 + exp(-z))为logistic回归模型。其中,b∈{- 1,1},w为系数向量,v为截距。
The parameter λ is a regularization parameter, and β is estimated b ∈{−1, 1}, w is the coefficient vector and v is the intercept.
参数λ是一个正则化参数,和β估计b∈{−1,1}, w系数向量和v是拦截。
The parameter λ is a regularization parameter, and β is estimated by minimizing this objective function.
参数λ是一个正则化参数,β估计通过最小化目标函数。
Randomized logistic regression assigns high scores to features that are repeatedly selected across randomizations.
随机逻辑回归给在随机化过程中反复选择的特征打分。
The more times a feature is selected, the more likely it is to be a stable variable.
一个特性被选择的次数越多,它就越有可能是一个稳定的变量。
The method requires a number of fits to subsamples of the data set and is, as such, much more computationally demanding.
该方法需要对数据集的子样本进行多次拟合,因此对计算的要求要高得多。
However, there were few features left after the previous two steps of filtering;
然而,在前两步过滤之后,剩下的特性很少;
therefore, it did not take much time for the stable
因此,它没有花太多的时间为稳定
(降维过程就得考虑最小化目标函数,起降维经历了三步,特征选择?稳定特征选择?得到排序的特征)
Model selection
Rank-based features were input to predict the status of cases whose value was 1 (cancer) or 0 (healthy).
输入基于等级的特征来预测值为1(癌症)或0(健康)的病例的状态。
We used stoch gradient descent (SGD), random forest (RF), SVM, logistic regres-sion (LR) and Gaussian Naive Bayes algorithms in the scikit-learn package (0.18.1) to construct classifiers;
GridSearchCV package to adjust the hyper-parameters;and 10-fold cross-validation to construct models.
我们在scikit-learn包中使用了stoch梯度下降(SGD)、随机森林(RF)、SVM、逻辑回归(LR)和高斯朴素贝叶斯算法(0.18.1)来构造分类器;GridSearchCV包可以调整超参数;,并进行10次交叉验证来构建模型。
Performance evaluatation
-
To classify cancer andnormal samples, the sensitivity, specificity and area under the receiver operating characteristic curve (AUC) of the classifier were estimated at different false discovery rate control levels
为了对癌症和正常样本进行分类,在不同的错误发现率控制水平下,估计分类器在接收机工作特性曲线(AUC)下的灵敏度、特异性和面积
Results
The schematic of breast cancer liquid biopsies using PBC gene expression data
使用PBC基因表达数据的乳腺癌液体活检示意图
-
We first used two types of single datasets (GSE68086 and GSE16443, which are RNA-seq and microarray data, respectively) to evaluate whether the rank-based model has the power to distinguish cancer and healthy subjects.
我们首先使用了两种类型的单数据集(GSE68086和GSE16443,分别是RNA-seq和微阵列数据)来评估基于排名的模型是否具有区分癌症和健康受试者的能力。
-
Then, we validated the strategy on intra- and inter-platform datasets.
然后,我们在平台内和平台间数据集上验证了该策略。
-
Finally, we integrated much more microarray data in the model construction process and tested the applicability of the model in both microarray and RNA-seq data.
最后,我们在模型构建过程中集成了更多的微阵列数据,并测试了该模型在微阵列和RNA-seq数据中的适用性。
-
For the validation process, we performed peripheral blood RNA-seq for 37 breast cancer patients and collected other public data (normal peripheral blood RNA-seq data from GTEx, normal and cancer peripheral blood microarray data from GSE11545 and GSE47862) (Figure 2)
在验证过程中,我们对37例乳腺癌患者进行了外周血RNA-seq检测,并收集了其他公共数据(GTEx正常外周血RNA-seq数据,GSE11545和GSE47862正常和癌症外周血微阵列数据)(图2)
-
Figure 2.
图2。
-
Overview of the study design.
研究设计概述。
-
We first used single datasets to confirm that a rank-based model has the power to distinguish cancer and healthy subjects.
我们首先使用单一数据集来确认基于排名的模型具有区分癌症和健康受试者的能力。
-
Then, we validated the strategy on intra- and inter-platform datasets.
然后,我们在平台内和平台间数据集上验证了该策略。
-
Finally, we concluded that the prediction performance was improved by including independent datasets and that a microarray-originated model can be applicable for RNA-seq data
最后,我们得出结论,包含独立的数据集可以提高预测性能,一个微阵列模型可以适用于RNA-seq数据.
Rank-based features do not reduce the classification power of cancer patients versus healthy control subjects
基于等级的特征并不会降低癌症患者与健康对照组的分类能力
-
To facilitate the comparison of datasets from different platforms and/or batches, we transformed gene expression intensities to relative information according to any two of the expressed genes.
为了便于比较来自不同平台和/或批次的数据集,我们根据任何两个表达的基因将基因表达强度转换为相关信息。
Since there was undoubtedly some information loss during the transformation, we selected two datasets to test the classification power of the transformed data, including gene expression profiles from PBCs in healthy donors and breast cancer patients detected by microarray (GSE16443) and platelet cell expression profiles from breast cancer patients and healthy subjects detected by RNA-seq (GSE68086).
自无疑是有信息丢失在转换的过程中,我们选择了两个数据集测试转换后的数据的分类能力,包括基因表达谱从健康的捐赠者的外周血和乳腺癌患者检测到微阵列(GSE16443)和血小板细胞表达谱从乳腺癌患者和健康受试者被RNA-seq (GSE68086)。
-
For each dataset, 80% of samples were used for training and the remaining 20% were used for validation.
对于每个数据集,80%的样本用于培训,其余20%用于验证。
After transforming the expression intensity of microarray data to pairwise gene order information, we obtained 22 885 995 features, of which 6 were retained after a three-step feature reduction approach.
将微阵列数据的表达强度转化为两两基因序列信息,得到22 885 995个特征,其中6个特征经过三步特征约简后保留。
For RNA-seq data, we selected genes with RPKM > 1 in more than 95% of the samples.
对于RNA-seq数据,我们在95%以上的样本中选择了RPKM > 1基因。
The RPKM value was then transformed to obtain pairwise gene order information.
然后将RPKM值转化为成对的基因序列信息。
After the transformation, there were 1 103 355 features;
转化后,共有1103 355个特征;
among them, 16 were retained after a three-step feature reduction.
其中16个是经过三步特征约简后保留下来的。
A comparison of different models showed that the RF model had better predictive power in these two datasets (Supplementary Figure S1).
不同模型的比较表明,RF模型在这两个数据集中具有更好的预测能力(补充图S1)。随机森林(RF)
The model distinguished cancer and healthy samples with an accuracy of 80.77% and 94.74% in GSE16443 and GSE68086, respectively (Figure 3A andC).
该模型对GSE16443和GSE68086中癌症和健康样本的区分准确率分别为80.77%和94.74%(图3A和c)。
-
The AUC reached 0.87 and 0.97 (Figure 2B and D), which is better than the intensity-based classification method [16, 24]and is significantly higher than the value for the randomly selected 6 and 16 features (Supplementary Figure S2).
AUC分别达到0.87和0.97(图2B和D),优于基于强度的分类方法[16,24],显著高于随机选取的6和16个特征值(补充图S2)。
The performance of the prediction models remains stable when changing the rank-based features (Supplementary Figure S3).
当改变基于排名的特性时,预测模型的性能保持稳定(补充图S3)。
We also tried to use fold change (FC) to describe the difference between genes,and we found that the FC follows an approximate negative binomial distribution in RNA-seq and a normal distribution in microarray data (Supplementary Figure S4), which is not suited for integration and training.
我们还尝试使用fold change (FC)来描述基因之间的差异,我们发现FC在RNA-seq中近似负二项分布,在微阵列数据中服从正态分布(补充图S4),不适合整合和训练。
Performance of the model in cross-validations between intra-platform datasets from different batches
模型在来自不同批次的平台内数据集之间的交叉验证中的性能
-
To test whether the rank-based model was sensitive to differences in laboratory conditions, reagent lots and personnel, we used the GSE16443 dataset as a training cohort (n = 130) and GSE11545 as an independent validation cohort.
为了检验基于秩的模型是否对实验室条件、试剂批次和人员的差异敏感,我们使用GSE16443数据集作为训练队列(n = 130), GSE11545作为独立验证队列。
-
The LR and SVMmodel produced the same performance.
LR和SVMmodel产生了相同的性能。
-
Both of them outperformed the RF, linear classifiers with SGD training, Gaussian Naive Bayes models, and they completely discrimi- nated cancer from healthy samples.
这两种方法都优于RF、经过SGD训练的线性分类器、高斯朴素贝叶斯模型,并且完全从健康样本中描述了癌症。
-
They performed well when applied to an independent validation cohort (GSE11545), with a sensitivity of 81.82%, specificity of 66.67%, accuracy of 75.00% (Figure 4A)and an AUCof0.80(Figure 4B).
当应用于独立验证队列(GSE11545)时,它们表现良好,灵敏度为81.82%,特异性为66.67%,准确度为75.00%(图4A), AUC 为 0.80(图4B)。
-
To test whether there are batch effects between these two datasets, we randomly selected normal and cancer samples from them, normalized them using z-scores and performed correlation analysis.
为了检验这两个数据集之间是否存在批量效应,我们随机选取了正常样本和癌症样本,使用z分数进行归一化,并进行相关分析。
-
The result showed that the expression consistency between the two datasets was very poor (Supplementary Figure S5).
结果表明,两个数据集之间的表达式一致性非常差(补充图S5)。
Furthermore, we compared the performance of our method with two most frequently used rank-based methods, the top scoring pair (TSP) [20] and k-top scoring pairs (k-TSP) [21], in selecting top gene pairs and distinguishing cancer and healthy subject in an independent study.
此外,在一项独立研究中,我们还将我们的方法与两种最常用的基于排名的方法(top score pair (TSP)[20]和k-top score pair (k-TSP)[21])的性能进行了比较,这两种方法在选择顶级基因对以及区分癌症和健康受试者方面的表现最为显著。
We used the GSE16443 data as a training cohort to build classifiers and GSE11545 data as an independent validation cohort.
我们使用GSE16443数据作为训练队列构建分类器,GSE11545数据作为独立的验证队列。
The optimal value of k was determined by a 5-fold cross-validation, representing five pairs of genes achieving top scores.
k的最优值是通过5倍交叉验证确定的,代表5对基因获得最高分。
Two of the five gene pairs were surrogated with median value due to LRRC37B and SRSF2 were not detected in the GSE11545 data.
GSE11545数据中未检测到LRRC37B和SRSF2, 5对基因中有2对被中值替代。
However, k-TSP–based model showed poor performance in predicting GSE11545 subjects, with a sensitivity of 100% and a specificity of 0%.
然而,基于k- tsp的模型在预测GSE11545受试者时表现较差,敏感性为100%,特异性为0%。
When k = 1, the k-TSP algorithm is referred simply as TSP.
当k = 1时,k-TSP算法简称为TSP。
We got the same performance as k- TSP when used any one of the three gene pairs to validatethe model in the GSE11545 data.
当我们在GSE11545数据中使用这三对基因中的任何一对来验证模型时,我们得到了与k- TSP相同的性能。
To look into the gene pairs in detail, we found that the expression of ACBD6 is higher than RPL37, MARS is higher than NGLY1 and NGLY1 is higher than GMFG in healthy subjects in training cohort (GSE16443) (Supplementary Figure S6A, C and E);
为了更详细的研究基因对,我们发现在训练队列中健康受试者ACBD6的表达高于RPL37, MARS高于NGLY1, NGLY1高于GMFG (GSE16443)(补充图S6A, C, E);
however, the tendency did not exist in the validation cohort (GSE11545) (Supplementary Figure S6B, D and F).
然而,在验证队列中不存在这种趋势(GSE11545)(补充图S6B, D和F)。
(和其它方法的性能比较)
Performance of the model in cross-validations between inter-platform datasets
模型在跨平台数据集之间的交叉验证中的性能
To determine whether the rank-based model could be used to predict datasets from different expression quantification platforms, we combined datasets from ABI Human Genome Survey Microarray v.2 (GSE16443) and Affymetrix Human Exon 1.0 ST Array (GSE47862).
为了确定基于排名的模型是否可以用于预测不同表达量化平台的数据集,我们结合了ABI人类基因组调查微阵列v.2 (GSE16443)和Affymetrix Human Exon 1.0 ST Array (GSE47862)。
To equalize sample numbers across platforms, we randomly selected 50% (n = 160) of GSE47862 samples and 80% (n = 104) of GSE16443 samples as the training set, with the remaining samples and an independent dataset from ABI microarray platform (GSE11545) constituting the validation set. A total of 108 features were retained after the dimension-reduction step.
平衡样本数据跨平台,我们随机选择GSE47862样本的50% (n = 160)和80% (n = 104) GSE16443样本作为训练集,其余的样品和一个独立的数据集从ABI微阵列平台(GSE11545)组成验证集。108特征被保留在降维后的一步。
The SVM model outperformed the others based on the training set. Validation was subsequently performed with independent validation cohorts that were not involved in feature selection or model training.
SVM模型的性能优于基于训练集的其他模型。随后,使用不涉及特征选择或模型训练的独立验证队列进行验证。
In GSE16443 (n = 26 samples), sensitivity, specificity and accuracy were 80.00%, 81.82% and 80.77%, respectively (Figure 5A), with an AUC of 0.88 (Figure 5B), which is better than the model from the single ABI microarray platform (Figure 3A and B). In GSE47862 (n = 161 samples), sensitivity, specificity and accuracy were 70.59%, 80.26% and 75.16%, respectively (Figure 5C), with an AUC of 0.84 (Figure 5D).
GSE16443 (n = 26)样品,敏感性、特异性和准确性分别为80.00%、81.82%和80.77%,分别为(图5),AUC为0.88 B(图5),这比单一ABI微阵列平台的模型好(图3 a和B)。在GSE47862样本(n = 161)、敏感性、特异性和准确性分别为70.59%、80.26%和75.16%,(图5 c), AUC为0.84(图5 d)。
-
In contrast, random classifiers generated from multiple rounds of random selection of gene pairs during the SVM training process had no predictive power.
相比之下,SVM训练过程中多轮随机选择基因对生成的随机分类器没有预测能力。
-
In GSE11545 (n = 20 samples), sensitivity, specificity and accuracy were 81.82%, 77.78% and 80.00%, respectively (Figure 5E), with an AUC of 0.84 (Figure 5F).
在GSE11545 (n = 20个样本)中,敏感性为81.82%,特异性为77.78%,准确性为80.00%(图5E), AUC为0.84(图5F)。
(跨平台数据集训练和验证。)
Improvement in the predictive performance of the model by integrating multicentre data
通过集成多中心数据提高模型的预测性能
-
We integrated a larger number of PBC gene expression datasets from different platforms to assess whether the algorithm could improve the predictive performance of an independent dataset in cancer detection.
我们整合了大量来自不同平台的PBC基因表达数据集,以评估该算法能否提高独立数据集在癌症检测中的预测性能。
-
Due to the differences in sample size between ABI Human Genome Survey Microarray and Affymetrix Human Exon Array training sets, we selected 100% of GSE16443 samples (n = 130) and 50% of GSE47862 samples (n = 160) to avoid bias introduced by platforms.
由于ABI人类基因组调查微阵列与Affymetrix人类外显子阵列训练集样本量存在差异,为了避免平台引入偏差,我们选择了100%的GSE16443样本(n = 130)和50%的GSE47862样本(n = 160)。
-
A total of 52 features (Supplementary Table S1) were selected after a three-step dimension reduction.
经过三步降维,共选择52个特征(补充表S1)。
-
The RF model outperformed other algorithms.
RF模型优于其他算法。
-
The model performed well when applied to an independent validation cohort (GSE11545) with a sensitivity of 90.91%, specificity of 100%, accuracy of 95% (Figure 6A)and AUC of 0.93 (Figure 6B), which is better than a model trained with less data in Figure 4 and 5.
将该模型应用于独立验证队列(GSE11545)时,其敏感性为90.91%,特异性为100%,准确率为95%(图6A), AUC为0.93(图6B),优于训练数据较少的模型(图4和图5)。
-
In the remaining 50% of GSE47862 (n = 161), sensitivity, specificity and accuracy were 76.47%, 80.26% and 78.26%, respectively (Figure 6C), with an AUC of 0.84 (Figure 6D).
GSE47862其余50% (n = 161)的灵敏度、特异性和准确性分别为76.47%、80.26%和78.26%(图6C), AUC为0.84(图6D)。
(一步一步排除比较平台内,平台间,上升到平台整合,剔除差异数据,最后提升算法准确性。)
Application of the model generated from multiple microarray platforms to RNA-seq samples
将多微阵列平台生成的模型应用于RNA-seq样本
-
To further test the generalizability of the model generated from ABI Human Genome Survey Microarray and Affymetrix Human Exon Array expression data, we applied the RF model to RNA-seq data.
为了进一步验证ABI人类基因组调查微阵列和Affymetrix人类外显子阵列表达数据生成的模型的通用性,我们将RF模型应用于RNA-seq数据。
-
From the GTEx project, we selected the whole blood expression dataset from females whose RNA samples were prepared with a PAXgene kit and obtained 137 healthy samples.
在GTEx项目中,我们选取了使用PAXgene试剂盒制备RNA样本的女性的全血表达数据集,得到137个健康样本。
-
For the breast cancer samples, we performed RNA-seq for peripheral blood from 37 patients.
对于乳腺癌样本,我们对37例患者的外周血进行了RNA-seq检测。
-
In the validation process, we used mode to represent the value of LOC100128076 MT1X because it neither exists in GTEx nor our RNA-seq data.
在验证过程中,我们使用mode来表示LOC100128076 MT1X的值,因为它既不存在于GTEx中,也不存在于我们的RNA-seq数据中。
-
In the combined RNA-seq data, sensitivity, specificity and accuracy were 83.78%, 67.15% and 70.69% (Figure 7A), respectively, with an AUC of 0.80 (Figure 7B)
结合RNA-seq数据,敏感性为83.78%,特异性为67.15%,准确性为70.69%(图7A), AUC为0.80(图7B).
(对于特异性的数据才分析当中发现的问题,于是采用分开的更贴切的模型来预测应用。)
Detection of prognostic markers in tumour tissue based on surrogate blood mRNA profiles
基于代用血mRNA谱检测肿瘤组织的预后标志物
-
For the 52 gene pairs (92 unique genes) identified with the RF model, we obtained the expression profiles of individual genes from The Cancer Genome Atlas (TCGA) breast cancer tissue samples (1091 cases).
对于RF模型鉴定的52对基因(92个独特基因),我们从TCGA乳腺癌组织样本(1091例)中获得了单个基因的表达谱。
-
A multivariate analysis showed that 18 gene pairs were significantly associated with breast cancer patient survival (Figure 8). We randomly selected 92 genes and generated 52 gene pairs, and we repeated the process 60 times and found that the number of gene pairs that was significantly associated with patient survival approximately followed the Poisson distribution (λ = 2), and the probability is <<0.001 for k ≥ 18
多元分析表明,18基因对与乳腺癌患者存活率显著相关(图8)。我们随机选择92个基因和生成的52个基因对,我们把这一过程重复60倍,发现基因对的数量明显与患者生存大约遵循泊松分布(λ= 2),和概率是< < 0.001 k≥18
Web service implementation
-
To facilitate the implementation of the prediction model, we constructed a web interface that can be accessed at http://bigd.big.ac.cn/rankDetect.
为了方便预测模型的实现,我们构建了一个web接口,该接口可以通过http://bigd.big.ac.cn/rankdetection访问。
-
Users can simply upload their expression matrix file, and the server will give the predicted status (healthy/cancer) of each sample.
用户只需上传他们的表达式矩阵文件,服务器就会给出每个样本的预测状态(健康/癌症)。
-
In addition, we hope the users will help refine the model by submitting the actual labels (if known) of their samples on the results page.
此外,我们希望用户通过在结果页面上提交他们的示例的实际标签(如果已知)来帮助改进模型。
Discussion
-
Blood-based liquid biopsies are a non-invasive and accessible method for cancer diagnosis, therapeutic decision-making, prognostic determination and monitoring of clinical progression and treatment response [25, 26].
基于血液的液体活检是一种非侵入性和可获得的方法,用于癌症诊断、治疗决策、预后判断和监测临床进展和治疗反应[25,26]。
-
To date, there are no expression- based methods that can extract multicentre data for cancer diagnostics [27, 28], hindering early cancer detection.
迄今为止,还没有一种基于表达的方法可以提取多中心数据用于癌症诊断[27,28],阻碍了早期癌症检测。
-
In this study, we propose a machine learning–based method that distinguishes breast cancer patients from healthy individuals based on pairwise rank transformation of gene expression intensity in each sample.
在本研究中,我们提出了一种基于机器学习的方法,基于每个样本中基因表达强度的两两秩变换来区分乳腺癌患者和健康个体。
-
The rank-based self-learning model can offer valuable information for breast cancer diagnosis and is insensitive to batch effects and data normalization.
基于秩的自学习模型可以为乳腺癌的诊断提供有价值的信息,对批量效应和数据归一化不敏感。
-
Our results suggest that a relative ordering-based method can make direct use of microarray data from different sources, thereby expediting research on human diseases.
我们的研究结果表明,基于相对排序的方法可以直接利用来自不同来源的微阵列数据,从而加快对人类疾病的研究。
-
Furthermore, the model trained from microarray data can also be applied to RNA-seq data, highlighting the clinical relevance of relative gene expression levels in blood.
此外,从微阵列数据训练的模型也可以应用于RNA-seq数据,突出了血液中相对基因表达水平的临床相关性。
-
Gene regulatory networks—and consequently, gene expression profiles [19]—differ between normal and disease states [29].
基因调控网络——因此,基因表达谱[19]——在正常状态和疾病状态之间存在差异。
-
Cancer can alter the gene expression in blood.
癌症可以改变血液中的基因表达。
-
PBCs comprise erythrocyte, white blood cell and platelet populations that are dynamic throughout cancer initiation and progres- sion [30].
PBCs包括红细胞、白细胞和血小板,它们在整个癌症的发生和发展过程中都是动态的。
-
Gene expression changes in circulating leukocytes can serve as an indicator of infection or diseases such as cancer [9].
循环白细胞的基因表达变化可以作为感染或癌症[9]等疾病的指标。
-
Additionally, the number of monocytes with a typical myeloid-derived suppressor cell surface phenotype was increased during breast cancer progression and was correlated with metastasis to lymph nodes and visceral organs [31].
此外,具有典型骨髓来源抑制细胞表面表型的单核细胞数量在乳腺癌进展过程中增加,并与淋巴结和内脏器官[31]转移相关。
-
A blood transcriptome-based diagnosismay also be applicable to different species.
基于血液转录组的诊断也可能适用于不同的物种。
-
One study established a classifier from genes identified as differentially expressed inmouse PBCs that showed good accuracy and high stability when applied to human breast tumour prediction based on gene expression in peripheral blood mononuclear cells [32]
一项研究从小鼠外周血单核细胞[32]基因表达的差异表达基因中建立了一个分类器,该分类器应用于基于基因表达的人类乳腺肿瘤预测中,准确率高,稳定性好.
-
We propose a human breast cancer predictor based on PBC mRNA signatures.
我们提出了一种基于PBC mRNA信号的人类乳腺癌预测器。
-
Our approach overcomes many of the constraints of previous models by using the relative expression orders of genes to reduce noise and to identify genes associated with breast cancer over other potentially confounding factors.
我们的方法克服了以往模型的许多限制,通过使用基因的相对表达顺序来降低噪音,并在其他潜在混杂因素的基础上识别与乳腺癌相关的基因。
-
By integrating datasets fromdifferent platforms and/or batches,we identified 52 stable gene pairs that could be important markers representing different patterns between breast cancer patients and health controls.
通过整合来自不同平台和/或批次的数据集,我们确定了52对稳定的基因对,它们可能是代表乳腺癌患者和健康对照组之间不同模式的重要标记。
-
In addition, based on these markers, a refined model was constructed, which was applicable in both sequencing and microarray platforms.
在此基础上,构建了一个适用于测序平台和微阵列平台的精细模型。
-
However, the results presented here have limited application in clinical settings for various reasons.
然而,由于种种原因,本文的研究结果在临床应用中受到限制。
-
First, the model with the best predictive power is not definitive;
首先,具有最佳预测能力的模型是不确定的;
-
we cannot exclude the possibility that a sufficient number of expression datasets can lead to the generation of a convergent model.
我们不能排除这样一种可能性,即足够数量的表达式数据集可以生成收敛模型。
-
Second, due to the scarcity of expression data from cancer blood samples, our analysis used only microarray data for model training, although the model from microarray data also performed well in normal blood RNA-seq data.
其次,由于癌症血液样本表达数据的缺乏,我们的分析仅使用微阵列数据进行模型训练,虽然微阵列数据的模型在正常血液RNA-seq数据中也表现良好。
-
These results serve as a proof-of-concept for using a rank-based model to develop a liquid biopsy and provide a statistical framework for constructing predictors in the context of not only breast cancer but also other malignancies.
这些结果为使用基于排名的模型开发液体活检提供了概念验证,并为构建预测因子提供了一个统计框架,不仅适用于乳腺癌,还适用于其他恶性肿瘤。
-
The rank-based normalization methods could also be expanded to other intensity-based features like miRNA expression and DNA methylation levels to remove batch effect from either different platforms or laboratories.
基于秩的归一化方法也可以扩展到其他基于强度的特征,如miRNA表达和DNA甲基化水平,以消除来自不同平台或实验室的批次效应。
-
Based on the principle of relative information,we plan to integrate more omics signatures such as blood DNA methylation to improve the algorithm.
基于相关信息原理,我们计划集成更多的组学特征,如血液DNA甲基化,以改进算法。
-
In addition to potential early cancer detection, our methods of normalization, feature selection and model construction also indicated new application scenarios in multi-cancer subtypes determination or even other disease diagnosis in the future.
除了潜在的早期癌症检测,我们的标准化、特征选择和模型构建方法也预示了未来在多癌亚型确定甚至其他疾病诊断中的新应用场景。
(单一数据预测只有(mRNA),测序样本数据的缺乏)