TOP是2022年发表在GB上的工具，基于rrBLUP包进行预测的工具
原文地址以目标为导向的优先级：通过育种中的预测分析整合生物体和分子性状的靶向选择策略 |基因组生物学
TOP GitHub 肖英杰 TOP

“TOP_Demo.r”是一个用户友好的R代码，只需要安装用于基因组预测的rrBLUP包，一个用于构建TOP模型和进行评估的自定义R脚本。
作者供三个 R 脚本：TOP_Demo.r、GP.r 和 TOP.r。
所有需要做的就是运行“TOP_Demo.r”，其中，程序将从“GP.r”和“TOP.r”获取函数，并导入两个示例数据“demo_geno.csv”和“demo_pheno.txt”。deom genetyic 数据包含 210 个个体的 1619 个bin，演示表型数据包含 210 个个体的 4 个表型性状。
GP.r的函数可以基于GBLUP模型进行基因组预测，需要将其作为TOP算法的输入来学习多性状的内在相关性。然后，来自 TOP.r 的函数将通过拟合每个人的成对预测值和观测值来构建 TOP 模型，并参数化多性状权重。TOP模型的优化将学习最佳权重，该权重将用于从测试池中选择候选个体，该个体在多个特征上显示出最高的相似性。

准备TOP需要的数据

表型数据格式

表型数据

表型数据第一列是样本ID,后面每列是1个表型

基因型数据格式是（0,1,2）矩阵

基因型数据

基因型格式是0，1，2格式。
第1列是SNP编号，后面每列是1个样本的基因型。
注意：基因型的样本ID和表型的样本ID必须完全一致

代谢组数据

样本的分组信息

样本分为2个组，一个是测试组，另一个是训练组

1.读取文件。代码示例，加载依赖的脚本引入函数

rm(list=ls())
library(rrBLUP)
# load the script to obtain the trait predictions of training set and test set
source("GP.r")
# load the script to build the top model and obtain the optimal trait weights for calculate the genome-phenome similarity between testing pool and a given target
source("TOP.r")
library(ggplot2)
# read genotype 
Genotype<-read.csv("./data/demo_geno.csv",
                       header=TRUE,
                       sep=",",
                       stringsAsFactors=0,
                       check.names=FALSE
                       )
head(Genotype) #210个样本（列），1619个SNP
# Bin R001 R002 R003 R004 R005 R006 R007 R008 R009 R010 R012 R013 R014 R015 R016 R017 R018 R019 R020 R021 R022 R023 R025 R027 R028 R029 R030
# 1 Bin1    0    1    1    1    0    1    0    0    1    1    0    0    0    0    0    0    0    0    0    0    1    1    0    1    1    0    0
# 2 Bin2    0    1    1    1    0    1    0    0    1    1    0    0    0    0    0    0    0    0    0    0    1    1    0    1    1    0    0

# read agronomic trait  210个样本，4个性状
Agro<-read.table("./data/demo_agro.txt",
                       header=TRUE,
                       sep="",
                       stringsAsFactors=0,
                       check.names=FALSE
                       )
head(Agro)
# RIL   Trait1    Trait2    Trait3   Trait4
# 1   R001 26.55000 11.575000  98.60000 24.23250
# 2   R002 18.65000 12.975000  61.80000 23.39500
# 3   R003 21.36250 12.250000  74.10000 24.02000

# read Metabolites 代谢组数据
Metabolite<-read.table("./data/demo_Metabolites.csv",
                       header=TRUE,
                       sep=",",
                       stringsAsFactors=0,
                       check.names=FALSE
                       )
head(Metabolite) #群体的代谢数据
# Trait     R001     R002     R003     R004     R005     R006     R007     R008     R009     R010     R012     R013     R014     R015
# 1 m0001-L 20.55729 20.66623 20.45463 20.77765 20.77013 21.10745 20.21756 20.91552 20.50470 19.90856 20.69053 20.14871 20.35235 20.55584
# 2 m0003-L 23.84497 23.71574 23.69052 24.10140 24.02722 24.15544 24.09703 24.33595 23.91331 24.10946 23.79013 23.76296 24.28952 24.09059
# 3 m0004-L 21.69904 21.78545 21.42181 22.28423 22.43024 22.39986 22.04627 22.32674 21.65157 21.53415 21.46748 22.19768 21.96240 21.49728

# divide a population into training set and test set 
id<-read.table("./data/demo_training_test_id.txt",
                      header=TRUE,
                      sep="",
                      stringsAsFactors=0,
                      check.names=FALSE)
head(id)  #数据分为测试集和训练集
# id  set
# 1 R057 test
# 2 R100 test
# 3 R086 test
# 4 R055 test
# 5 R031 test
# 6 R087 test
library(tidyverse)
summary(id)
sum(id$set=="test") #105
sum(id$set=="training") #105

这里示例时测试集和训练集是1：1，其实一般情况可以是3：7或2：8或1：9

2. 开始分析

2.1 获取所有性状的预测值

基于流行的 GBLUP 模型，在‘rrBLUP’包中实现，首先从基因组数据中获得性状预测。
在TOP方法中，我们需要先获得训练集和测试集个体的预测值。
在训练集中，特征预测是针对 TOP 模型学习的。使用10倍交叉验证获取预测值的策略。
在测试集中，特征预测用于评估 TOP 准确度。我们使用所有训练数据来预测测试集个体的准确性。

Pre<-Prediction_trait(Genotype,#基因型
                      Agro,#表型
                      Metabolite,#代谢组
                      id,#训练和测试集
                      pca=TRUE, #如果设置为TRUE,则降低代谢组数据的维度
                      CEVR=0.8 #主成分的累积解释方差
                      )

prediction_train<-Pre$prediction_train #训练集的预测值
trait_train<-Pre$trait_train #训练集观察的特征值（真实值）

# obtain the predicted trait values of the test set using GBLUP 
prediction_test<-Pre$prediction_test #测试集中的预测值
trait_test<-Pre$trait_test #测试集的特征值（真实值）

2.2 构建TOP模型并获得最优权重

names_test<-Pre$names_test                   #测试集中的样本名称
names_trait<-Pre$names_trait                 #特征的名称 表型和PC
#target<-trait_train[80,]                     #查看训练集第80个样本的真实特征值

Optimal_Weight<-Weight_res(prediction_train,
                           trait_train,
                           names_trait,
                           b=0.25)

prediction_train: 训练集的预测的特征值，最后一行存储十轮的平均预测精度。
trait_train: 训练集的真实特征值
names_trait: 特征的名称。
b: 进入TOP模型的预测精度阈值,默认是0.25.

2.3 检测TOP模型的精确度

使用从训练集中的 TOP 模型学到的最佳权重，用户可以在独立测试集中测试匹配基因组预测和测量表型的准确性。
从技术上讲，从测试集中的一组个体（例如，N=5）中，测试所有 5 个个体的特征预测与其特征观察的全局相似性。如果一个个体的预测与特征观察本身最匹配，则被认为是成功的识别。池中识别成功的比例定义为识别率和TOP准确率。

Weight<-Optimal_Weight$W_matrix #获取根据训练集计算的特征值的权重指数
TOP_acc<-Test_top_acc(prediction_test,#测试集的预测的特征值
            trait_test,##测试集的真实的特征值
                      Weight,#从测试集得到的特征值的权重指数
                      names_trait,#特征值的名称
                      N=c(2,5,10),#检测池的样本的大小，这个池的样本越大，计算出的准确率越低
                      m=200) #从测试集中选择个体池时的随机排列数

prediction_test 测试集的预测的特征值
trait_test 测试集的真实的特征值
Weight 从测试集得到的特征值的权重指数
names_trait 特征值的名称
N=c(2,5,10) 检测池的样本的大小，这个池的样本越大，计算出的准确率越低
m=200 从测试集中选择个体池时的随机排列数
简单说就是把测试集的样本分为不同的数量的小组，例如2个一组或5个一组，分别去检测组内的样本的预测值和真实值的相关性，最后计算所有小组的准确率。
TOP_acc$Ide_rate这个命令会返回计算的不同组的准确率。和N是一一对应的关系。

2.4 选择优于现有目标的优异个体

为了辅助育种，有必要提供一份在一个或几个主要性状（即早熟、疾病、抗性和高产）优异的方案。此处就通过Top_Target提供了一种可管理的解决方案，可以选择同时具有优越性状和相对稳定的剩余性状的个体。
这种方法可能有助于提高选择混合动力的收益优于广泛的商业品种。

# 根据相似度函数选择测试集中的个体
pre_train_mean<-Optimal_Weight$pre_mean 
pre_train_sd<-Optimal_Weight$pre_sd
obs_train_mean<-Optimal_Weight$obs_mean
obs_train_sd<-Optimal_Weight$obs_sd

Top_target_sel<-Top_target(target,#现有的品种的特征值向量
                           prediction_test, #测试集的预测的特征值
                           names_test,#测试集的样本名称
                           names_trait,#特征值
                           pre_train_mean,#训练集预测的特征值的均值
                           pre_train_sd,##训练集预测的特征值的标准差
                           obs_train_mean,#训练集的特征值的真实值的均值
                           obs_train_sd,#训练集的特征值的真实值的标准差
                           selection_ratio=0.1, #选择比率，测试集中与目标相似度最大的个体比例
                           improve_ratio=0.05, #改善的比率，某个性状值与目标值提高的比例
                           improve_trait=4,#如果为NA则不会修改目标的特征值，如果是4，会修改所有的样本的特征4的值。
                           Weight)#根据训练集计算的特征值的权重指数

target现有的品种的特征值向量
prediction_test测试集的预测的特征值
names_test#测试集的样本名称
names_trait特征值
pre_train_mean训练集预测的特征值的均值
pre_train_sd训练集预测的特征值的标准差
obs_train_mean训练集的特征值的真实值的均值
obs_train_sd训练集的特征值的真实值的标准差
selection_ratio=0.1选择比率，测试集中与目标相似度最大的个体比例
improve_ratio=0.05 改善的比率，某个性状值与目标值提高的比例
improve_trait=4如果为NA则是所有的特征值，如果是4，重点是特征4的值。指定需要优化的特征值的位置
Weight根据训练集计算的特征值的权重指数
Top_target_sel$names_select返回挑选的测试集中最优秀的10个样本的名称

3.输出数据，可视化

write.table(Weight,file="./res/demo_optimal_weight.txt",sep='\t',row.names=F,quote=F)
# In the similarity matrix of pool size of 5, each column means a given target individual from the pool, the 5 values for this column indicate the genome-phenome similarity between all individuals from the pool and this target individual. If the target individual itself has the highest similarity degree, we regard it as a successful identification. 
write.table(TOP_acc$Demo_p,file="./res/demo_similarity_matrix_poolsize5.txt",sep='\t',quote=F) 

# At the pool size of 2,5 and 10, we output a demo result of identification rate, which means the proportion of successfull identification by using each individal of a pool as a target.
write.table(data.frame(pool_size=c(2,5,10),identification_rate=TOP_acc$Ide_rate),file="./res/demo_identification_rate.txt",sep='\t',row.names=F,quote=F) 

# Let the first individual in the training set be a target, we select individuals in the test set according to the similarity function. 
Top_sel<-cbind(Top_target_sel$names_select,Top_target_sel$values_select)
colnames(Top_sel)<- c("RIL",as.character(Weight[,1]))
write.table(Top_sel,file="./res/demo_target_select.txt",sep='\t',row.names=F,quote=F) 

library(ggplot2)
Top_target_sel_name<-Top_target_sel$names_select
plot_TOP(target,
         trait_test,
         Top_target_sel_name,
         improve_trait=4,
         group_random=5)
plot_TOP(target,
         trait_test,
         Top_target_sel_name,
         improve_trait=1,
         group_random=5)

plot_TOP(target,
         trait_test,
         Top_target_sel_name,
         improve_trait=2,
         group_random=5)

plot_TOP(target,
         trait_test,
         Top_target_sel_name,
         improve_trait=3,
         group_random=5)

进行可视化的时候，参数说明
target就是上面选择的商业品种的特征值
trait_test测试集的特征值
Top_target_sel_name挑选出的特征最优的样本的编号列表
improve_trait=4要优化的特征所在的列编号
group_random=5随机挑选的次数

image.png

从图中可以看到，即使是选择优化特征4选出来的样本，他的特征1的值也比随机选的要优秀。

TOP用于表型预测

TOP用于表型预测

准备TOP需要的数据

表型数据格式

基因型数据格式是（0,1,2）矩阵

代谢组数据

1.读取文件。代码示例，加载依赖的脚本引入函数

2. 开始分析

2.1 获取所有性状的预测值

2.2 构建TOP模型并获得最优权重

2.3 检测TOP模型的精确度

2.4 选择优于现有目标的优异个体

3.输出数据，可视化

推荐阅读更多精彩内容

TOP用于表型预测

准备TOP需要的数据

表型数据格式

基因型数据格式是（0,1,2）矩阵

代谢组数据

1.读取文件。 代码示例，加载依赖的脚本引入函数

2. 开始分析

2.1 获取所有性状的预测值

2.2 构建TOP模型并获得最优权重

2.3 检测TOP模型的精确度

2.4 选择优于现有目标的优异个体

3.输出数据，可视化

推荐阅读更多精彩内容

1.读取文件。代码示例，加载依赖的脚本引入函数