Picrust2|通过KO丰度获得ko丰度的计算原理

背景

image.png

Picrust2 的 pathway_pipeline.py 脚本是一个复杂的程序，旨在从输入的 KO 层级相对丰度表和 KO 与 KO 的映射关系文件中计算得到 KO 的丰度表。这个程序不仅仅是简单地将 KO 的丰度相加，而是通过多个步骤和计算来完成的。以下是脚本的主要逻辑和计算步骤：

整体流程

1.读取输入文件：

read_metagenome_input() 函数读取输入的 KO 层级相对丰度表，并判断它是非分层表（unstratified）、分层表（stratified）还是贡献表（contributional）。
PathwaysDatabase 类读取 KO 与 KO 的映射关系文件，并将其解析为路径结构和关键反应集合。

2.预处理：

如果设置了重新分组（regroup），则使用 regroup_func_ids() 函数将输入表中的功能 ID 重新分组为新的 ID。
将输入表中没有在路径数据库中的反应去掉。

3.计算路径水平：

根据输入表的格式（非分层、分层或贡献格式），调用不同的函数来计算路径水平。
对于每个样本，调用 unstrat_pathway_levels()、basic_strat_pathway_levels() 或 contrib_format_pathway_levels() 函数来计算路径丰度和覆盖率。

4.计算路径丰度和覆盖率：

pathway_abun_and_coverage() 函数根据路径结构和反应丰度来计算每条路径的丰度和覆盖率。
如果路径是有结构的，则使用 compute_structured_pathway_abundance_or_coverage() 函数计算结构化路径的丰度或覆盖率。
如果路径是无结构的，则简单地取第二部分反应的平均丰度作为路径丰度。

5.输出结果：

根据计算结果，生成非分层和分层的路径丰度表，并写入输出目录。

关键计算步骤

1. 计算路径丰度和覆盖率

实际的路径丰度和覆盖率计算发生在 pathway_abun_and_coverage() 函数中。该函数首先检查路径是否有结构：

结构化路径：
使用 compute_structured_pathway_abundance_or_coverage() 函数计算。这是基于 HUMAnN2 的计算方法，该方法使用路径结构和关键反应来计算丰度或覆盖率。
如果需要进行 gap 填充，则使用 gap_fill() 函数进行填充。
无结构路径：
首先对所有反应丰度进行排序，取后半部分反应的平均丰度作为路径丰度。
如果需要计算覆盖率，则计算高于中位数丰度的反应所占比例。

2. 基于结构的路径丰度计算

在 compute_structured_pathway_abundance_or_coverage() 函数中：

递归遍历路径结构，计算每个子结构的丰度或覆盖率。
根据连接符（, 或 +），使用不同的方法来组合子结构的丰度或覆盖率：
- , 意味着 OR 连接，取最大值。
- + 意味着 AND 连接，取所有反应的调和平均值。

3. gap 填充

在 gap_fill() 函数中：

如果最多只有一个关键反应的丰度为 0，则将丰度最低的关键反应的丰度设置为第二低的值，以填补缺失值。

总结

Picrust2 的 pathway_pipeline.py 脚本通过复杂的逻辑和计算步骤来计算 KO 丰度，而不仅仅是简单的加和。它考虑了路径结构、关键反应、gap 填充和覆盖率计算等多个因素，以提供更准确的结果。

细节理解

1.判断输入表的格式

在 Picrust2 的 pathway_pipeline.py 脚本中，判断输入表的格式是通过 read_metagenome_input()函数完成的。这个函数读取输入的 KO 层级相对丰度表，并根据特定的列名来判断表格的格式。具体来说，该函数会检查输入表中是否存在某些关键列，通过这些列来确定输入表的格式是非分层（unstratified）、分层（stratified）还是贡献（contributional）。以下是具体的判断逻辑：

def read_metagenome_input(filename):
    '''Reads in gene abundance table which can be either unstratified,
    stratified, or contributional format (outputs of metagenome_pipeline.py).
    Will return the Pandas dataframe of this table and one of "unstrat",
    "strat" or "contrib" to indicate the format.'''

    # Read in input file as pandas dataframe.
    input_df = pd.read_csv(filename, sep="\t", dtype={'function': str})

    # Check that 'function' column is present, which is in all three input
    # formats.
    if 'function' not in input_df.columns:
        sys.exit("Error: required column \"function\" not found in input " +
                 "metagenome table")

    # Determine format based on presence of specific columns.
    if 'genome_function_count' in input_df.columns:
        table_type = "contrib"
        input_df['sample'] = input_df['sample'].astype('str')
        input_df['taxon'] = input_df['taxon'].astype('str')
    elif 'sequence' in input_df.columns:
        table_type = "strat"
        input_df['sequence'] = input_df['sequence'].astype('str')
    else:
        table_type = "unstrat"

    return(input_df, table_type)

判断逻辑
1.读取输入文件：
a.使用 Pandas 的 read_csv 函数读取输入文件，并将其存储为一个 DataFrame。
2.检查关键列：
a.首先检查输入表中是否包含 function 列，这一列在所有三种格式中都是存在的。如果没有找到这一列，则抛出错误。
3.判断表格格式：
a.贡献格式（contributional）：如果表中包含 genome_function_count 列，则表明这是一个贡献格式的表格。
i.其他相关列有 sample 和 taxon，这些列表示样本和物种信息。
b.分层格式（stratified）：如果表中包含 sequence 列，则表明这是一个分层格式的表格。
i.在这种格式下，sequence 列表示基因或功能的具体序列信息。
c.非分层格式（unstratified）：如果表中既不包含 genome_function_count 列，也不包含 sequence 列，则表明这是一个非分层格式的表格。
格式说明
●非分层格式（unstratified）：
○这种格式的表格中，每一行表示一个功能（function）的丰度，没有进一步的分层信息。
○例如：

function	sample1	sample2
KO1	0.1	0.2
KO2	0.3	0.4

●分层格式（stratified）：
○这种格式的表格中，每一行表示一个功能和其对应的序列（sequence）的丰度。
○例如：

function	sequence	sample1	sample2
KO1	Seq1	0.05	0.1
KO1	Seq2	0.05	0.1
KO2	Seq3	0.3	0.4

●贡献格式（contributional）：
○这种格式的表格中，每一行表示一个功能在特定样本和物种中的贡献丰度。
○例如：

sample	function	taxon	taxon_abun	taxon_rel_abun	genome_function_count	taxon_function_abun	taxon_rel_function_abun
S1	KO1	Taxon1	0.5	0.25	2	0.1	0.05
S1	KO2	Taxon2	0.5	0.25	3	0.15	0.075

小结
read_metagenome_input() 函数通过检查输入表中的特定列来判断表格的格式。这使得脚本能够根据不同的表格格式采用不同的处理方法来计算 KO 丰度。具体来说，贡献格式的表格包含详细的样本和物种信息，而分层格式的表格包含具体的序列信息，非分层格式的表格则只包含功能的整体丰度信息。

2.判断路径是否是有结构

在 Picrust2 的 pathway_pipeline.py 脚本中，判断路径是否是有结构的，是通过 PathwaysDatabase 类来实现的。具体来说，该类的 is_structured() 方法用于检查路径数据库是否包含结构化信息。
PathwaysDatabase 类
PathwaysDatabase 类负责读取和存储路径数据库的所有信息，包括路径与反应的映射关系、路径结构、关键反应等。以下是该类的简要说明：

class PathwaysDatabase:
    def __init__(self, database=None, reaction_names=[]):
        '''Load in the pathways data from the database file.'''
        # 初始化各种字典来存储路径和反应的信息
        self.__pathways_to_reactions = {}
        self.__reactions_to_pathways = {}
        self.__pathways_structure = {}
        self.__key_reactions = {}

        if database is not None:
            # 检查数据库文件是否存在
            check_files_exist([database])
            file_handle = open(database, "rt")
            line = file_handle.readline()

            reactions = defaultdict(list)
            structured_pathway = False
            while line:
                data = line.strip().split("\t")
                if len(data) > 1:
                    pathway = data.pop(0)
                    reactions[pathway] += data
                    if "(" in data[0]:
                        structured_pathway = True
                line = file_handle.readline()
            file_handle.close()

            if structured_pathway:
                reactions = self._set_pathways_structure(reactions, reaction_names)
            self._store_pathways(reactions)

    def is_structured(self):
        '''Return True if this is a set of structured pathways.'''
        if self.__pathways_structure:
            return True
        else:
            return False

判断路径是否有结构
is_structured() 方法的实现非常简单，它只检查 __pathways_structure 字典是否为空：
●如果 __pathways_structure 字典不为空，说明路径数据库包含结构化信息，返回 True。
●如果 __pathways_structure 字典为空，说明路径数据库不包含结构化信息，返回 False。
计算路径丰度和覆盖率时的判断
在 pathway_abun_and_coverage() 函数中，会调用 is_structured() 方法来判断路径是否是有结构的：

def pathway_abun_and_coverage(pathway, pathway_db, reaction_abun, median_value, calc_coverage, gap_fill_on):
    '''Determine pathway abundance and coverage for either structured or
    unstructured pathway. Calculating coverage is off by default.'''

    pathway_cov = None

    # Check if pathway database is structured. If so then get structure and
    # use it and the key reactions to get pathway abundance and coverage.
    if pathway_db.is_structured():
        structure = pathway_db.get_structure_for_pathway(pathway)
        key_reactions = pathway_db.get_key_reactions_for_pathway(pathway)

        # Run gap filling if set.
        if gap_fill_on:
            reaction_abun = gap_fill(key_reactions, reaction_abun)

        # Calculate pathway abundance.
        pathway_abun = compute_structured_pathway_abundance_or_coverage(structure,
                                                                        key_reactions,
                                                                        reaction_abun,
                                                                        False,
                                                                        median_value)
        if calc_coverage:
            # Calculate pathway coverage.
            pathway_cov = compute_structured_pathway_abundance_or_coverage(structure,
                                                                           key_reactions,
                                                                           reaction_abun,
                                                                           True,
                                                                           median_value)

    else:
        # Otherwise, sort enzyme reactions, take second half, and get their
        # mean abundance.
        reaction_abun_only = list(reaction_abun.values())
        sorted_index = list(np.argsort(reaction_abun_only))
        sorted_reaction_abun = [reaction_abun_only[i] for i in sorted_index]
        half_i = int(len(sorted_reaction_abun) / 2)
        subset_reaction_abun = sorted_reaction_abun[half_i:]
        pathway_abun = sum(subset_reaction_abun) / len(subset_reaction_abun)

        if calc_coverage:
            # Get coverage of unstructured pathway by getting the proportion
            # of reactions with scores greater than the median.
            count_higher_than_median = 0
            for abun in reaction_abun_only:
                if abun > median_value:
                    count_higher_than_median += 1
            pathway_cov = count_higher_than_median / len(reaction_abun_only)

    return(pathway_abun, pathway_cov)

计算逻辑
1.检查路径是否有结构：
a.使用 pathway_db.is_structured() 方法来判断路径数据库是否包含结构化信息。
2.如果路径有结构：
a.获取路径的结构和关键反应（structure 和 key_reactions）。
b.如果设置了 gap 填充，则调用 gap_fill() 函数进行填充。
c.调用 compute_structured_pathway_abundance_or_coverage() 函数计算路径丰度和覆盖率。
3.如果路径没有结构：
a.对所有反应丰度进行排序，取后半部分反应的平均值作为路径丰度。
b.如果需要计算覆盖率，则计算高于中位数丰度的反应所占比例。
小结
判断路径是否有结构是通过 PathwaysDatabase 类的 is_structured() 方法实现的。在计算路径丰度和覆盖率时，如果路径是有结构的，会使用更复杂的结构化计算方法；如果路径是无结构的，则使用简单的平均值计算方法。

傻瓜版教程

如果输入的表为正常的非分层表，输入的映射关系为无结构路径。那么的计算公式为：截断平均值 (Truncated Mean)法，即将ko涉及的所有KO丰度由大到小排序后，截取上半部分进行取平均值。
举个🌰
映射文件：

ko00311 K12743  K10852  K04128  K12746  K12745  K00273  K01060  K04127  K12748  K01434  K01467  K12744  K12747  K04126

KO相对丰度表：

function	HSM1
K00273	313.42
K01060	1583.43
K01434	3010.35
K01467	10563.01
K04127	143
K04128	0

ko的计算公式为：
（10563.01+3010.35+1583.43+313.42+143）/ 7 = 2230.4586
完结撒花🎉🎉

看没看懂都点个赞呗~