forked: WGCNA FAQ

WGCNA是一款常用的基因共表达网络分析软件,因其算法直观和稳健广受推崇。实际应用时因输入数据和分析需要的不同会产生各种问题,其中大部分是之前的使用者碰到过的。发布者Peter Langfelder和Steve Horvath对此做过总结,本文选取了最常见的问题进行转载。想获取所有的FAQ可移步https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html

WGCNA package FAQ

Peter Langfelder and Steve Horvath

Dept. of Human Genetics, UC Los Ageles (PL, SH), Dept. of Biostatistics, UC Los Ageles (SH)

This page provides a list of Frequently Asked Questions and our frequently given answers. Please read these before emailing us about a problem. This FAQ was last updated on December 24, 2017.

Data analysis questions

  1. How many samples do I need?

    We do not recommend attempting WGCNA on a data set consisting of fewer than 15 samples. In a typical high-throughput setting, correlations on fewer than 15 samples will simply be too noisy for the network to be biologically meaningful. If at all possible, one should have at least 20 samples; as with any analysis methods, more samples usually lead to more robust and refined results.

  2. Should I filter probesets or genes?

    Probesets or genes may be filtered by mean expression or variance (or their robust analogs such as median and median absolute deviation, MAD) since low-expressed or non-varying genes usually represent noise. Whether it is better to filter by mean expression or variance is a matter of debate; both have advantages and disadvantages, but more importantly, they tend to filter out similar sets of genes since mean and variance are usually related.

    We <u>do not</u> recommend filtering genes by differential expression. WGCNA is designed to be an unsupervised analysis method that clusters genes based on their expression profiles. Filtering genes by differential expression will lead to a set of correlated genes that will essentially form a single (or a few highly correlated) modules. It also completely invalidates the scale-free topology assumption, so choosing soft thresholding power by scale-free topology fit will fail.

  3. What argument (option) settings are recommended?

    In general, we attempt to select suitable defaults that work well in multiple applications. However, in certain cases we keep 'simple' or historical default settings for backward compatibility and reproducibility, while for new calculations we recommend settings that differ from the defaults. Some of the settings are listed below.

    • Signed networks. The choice of signed vs. unsigned networks is complex, but in general we prefer signed (or "signed hybrid") networks to unsigned networks. To construct signed networks, use argument type = "signed" or type = "signed hybrid" in functions such as accuracyMeasures, adjacency, chooseOneHubInEachModule, chooseTopHubInEachModule, nearestNeighborConnectivity, nearestNeighborConnectivityMS, orderBranchesUsingHubGenes, softConnectivity and possibly others (please see the help file for each function if in doubt). Some functions use the argument **networkType** to select network type; notable examples areblockwiseModules, blockwiseConsensusModules, blockwiseIndividualTOMs, consensusTOM, intramodularConnectivity, modulePreservation, pickSoftThreshold, TOMsimilarityFromExpr, vectorTOM but there are others as well. Again, please read the help file if in doubt.

    • Robust correlation. The default correlation method in all functions in WGCNA is standard Pearson correlation. In general, unless there is good reason to believe that there are no outlier measurements, we recommend (and use ourselves) the biweight mid-correlation as a robust alternative. This is implemented in WGCNA function bicor. Many WGCNA functions take the argument corFnc that allows one to specify an alternative correlation function to the standard cor and bicor is one option. Additional arguments to the correlation function can be specified using the argument corOptions (depending on function, this argument may require one of two alternate forms, please see the help for each function for details). In certain functions, notably the of the blockwise family, correlation function cannot be specified directly as a function; rather, one must use the argument corType to specify either Pearson or biweight mid-correlation.

      Important cautionary notes regarding the use of bicor. The biweight mid-correlation works very well in a variety of settings but in some situations it will produce unwanted results.

      • Restricting the number of excluded outliers: argument maxPOutliers. The default version of the biweight mid-correlation, described in Langfelder and Horvath (2011) (link to article), can produce unwanted results when the data have a bi-modal distribution (e.g., when a gene expression depends heavily on a binary variable such as disease status or genotype) or when one of the variables entering the correlation is itself binary (or ordinal). For this reason, we strongly recommend using the argument maxPOutliers = 0.05 or 0.10 whenever the biweight midcorrelation is used. This argument essentially forces bicor to never regard more than the specified proportion of samples as outliers.
      • Dealing with binary data. When relating high-throughput data x to binary variable y such as sample traits, one can use argument robustY = FALSE to turn off the robust treatment for the y argment of bicor. This results in a hybrid robust-Pearson correlation as described in Langfelder and Horvath (2011). The hybrid correlation can also be used when one of the inputs is numeric but known to not have any outliers.
  4. Can WGCNA be used to analyze RNA-Seq data?

    Yes. As far as WGCNA is concerned, working with (properly normalized) RNA-seq data isn't really any different from working with (properly normalized) microarray data.

    We suggest removing features whose counts are consistently low (for example, removing all features that have a count of less than say 10 in more than 90% of the samples) because such low-expressed features tend to reflect noise and correlations based on counts that are mostly zero aren't really meaningful. The actual thresholds should be based on experimental design, sequencing depth and sample counts.

    We then recommend a variance-stabilizing transformation. For example, package DESeq2 implements the function varianceStabilizingTransformation which we have found useful, but one could also start with normalized counts (or RPKM/FPKM data) and log-transform them using log2(x+1). For highly expressed features, the differences between full variance stabilization and a simple log transformation are small.

    Whether one uses RPKM, FPKM, or simply normalized counts doesn't make a whole lot of difference for WGCNA analysis as long as all samples were processed the same way. These normalization methods make a big difference if one wants to compare expression of gene A to expression of gene B; but WGCNA calculates correlations for which gene-wise scaling factors make no difference. (Sample-wise scaling factors of course do, so samples do need to be normalized.)

    If data come from different batches, we recommend to check for batch effects and, if needed, adjust for them. We use ComBat for batch effect removal but other methods should also work.

    Finally, we usually check quantile scatterplots to make sure there are no systematic shifts between samples; if sample quantiles show correlations (which they usually do), quantile normalization can be used to remove this effect.

  5. My data are heterogeneous. Can I still use WGCNA?

    Data heterogeneity may affect any statistical analysis, and even more so an unsupervised one such as WGCNA. What, if any, modifications should be made to the analysis depends crucially on whether the heterogeneity (or its underlying driver) is considered "interesting" for the question the analyst is trying to answer, or not. If one is lucky, the main driver of sample differences is the treatment/condition one studies, in which case WGCNA can be applied to the data as is. Unfortunately, often the heterogeneity drivers are uninteresting and should be adjusted for. Such factors can be technical (batch effects, technical variables such as post-mortem interval etc.) or biological (e.g., sex, tissue, or species differences).

    If one has a categorical source of variation (e.g., sex or tissue differences) and the number of samples in each category is large enough (at least 30, say) to construct a network in each category separately, it may be worthwhile to carry out a consensus module analysis (Tutorial II, see WGCNA Tutorials). Because this analysis constructs a network in each category separately, the between-category variation does not affect the analysis.

    If it is desired to construct a single network for all samples, the unwanted or uninteresting sources of large variation in the data should be adjusted for. For categorical (ordinal) factors we recommend using the function ComBat (from the package sva). Users who have never used ComBat before should read the help file for ComBat and work through the sva vignette (type vignette("sva") at the R prompt) to make sure they use ComBat correctly.

    For continuous sources of variation (e.g., postmortem interval), one can use simple linear regression to adjust the data. There may be more advanced methods out there that also allow the use of covariates and protect from over-correction.

    Whichever method is used, we caution the user that removal of unwanted sources of variation is never perfect and it can, in some cases, lead to removal of true interesting signal, and in rare cases it may introduce spurious association signal. Thus, only sources of relatively large variation should be removed.

  6. I can't get a good scale-free topology index no matter how high I set the soft-thresholding power.

    First, the user should ensure that variables (probesets, genes etc.) have not been filtered by differential expression with respect to a sample trait. See item 2 above for details about beneficial and detrimental filtering genes or probesets.

    If the scale-free topology fit index fails to reach values above 0.8 for reasonable powers (less than 15 for unsigned or signed hybrid networks, and less than 30 for signed networks) and the mean connectivity remains relatively high (in the hundreds or above), chances are that the data exhibit a strong driver that makes a subset of the samples globally different from the rest. The difference causes high correlation among large groups of genes which invalidates the assumption of the scale-free topology approximation.

    Lack of scale-free topology fit by itself does not invalidate the data, but should be looked into carefully. It always helps to plot the sample clustering tree and any technical or biological sample information below it as in Figure 2 of Tutorial I, section 1; strong clusters in the clustering tree indicate globally different groups of samples. It could be the result a technical effect such as a batch effect, biological heterogeneity (e.g., a data set consisting of samples from 2 different tissues), or strong changes between conditions (say in a time series). One should investigate carefully whether there is sample heterogeneity, what drives the heterogeneity, and whether the data should be adjusted (see previous point).

    If the lack of scale-free topology fit turns out to be caused by an interesting biological variable that one does not want to remove (i.e., adjust the data for), the appropriate soft-thresholding power can be chosen based on the number of samples as in the table below. This table has been updated in December 2017 to make the resulting networks conservative.

    <center>

    | Number of samples | Unsigned and signed hybrid networks | Signed networks |
    | Less than 20 | 9 | 18 |
    | 20-30 | 8 | 16 |
    | 30-40 | 7 | 14 |
    | more than 40 | 6 | 12 |

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。