Like DEseq2, edgeR does not use RPKM, TPM, etc. This is because it needs to adjust for:
-
Sequencing Depth (that’s that RPKM etc. deal with).
-
Library Composition (diffenrent samples contain different active genes).
How edgeR normalizies libraries
Step 1 Remove all untranscribed genes (remove genes with 0 read counts in all samples).
Step 2 Pick one sample to be the “reference sample”, which would be used to normalize all of the other samples against.
What is a good/bad reference sample?
Step 3 Select the genes for calculating the scaling factors. This is done separately for each sample relative to the “reference sample”.
We’ll start by looking at the different types of genes to choose from.
edgeR selects the genes in the middle, with more effort put into excluding biased genes.
Now that we have a table of log ratios to identify biased genes, let’s make another table to identify genes that are highly and lowly transcribed in both samples.
To identify genes that are high and low in both samples, first calculate the geometric mean for each gene. The geometric mean is not easily influenced by outliers.
Now we have two tables, one to identify biased genes (log2(Reference/Sample2)), and one to identify genes that are highly and lowly transcribed in both samples (mean of logs).
Filter out the top 30% and the bottom 30% biased genes.
Filter out the top 5% and the bottom 5% of the highly and lowly transcribed genes.
Then genes that are still in both lists are used to calculate the scaling factor. (Unfortunately, the genes in our example that are in both lists are “…”.)
Step 4 Calculate the weighted average of the remaining log2 ratios.
FYI (for your information), edgeR calls this the: “weighted trimmed mean of the log2 ratios”, because we “trimmed” off the most extreme genes.
By excluding the extreme genes, we avoid the effect of outliers (sort of like using the geometric mean).
Once you have selected which genes will be used to calculate the scaling factor, just calculate the ##weighted average## of their log2 ratios. Genes with more reads mapped to them get more weight, because they are less noisy. This because log ratios have more variance with low read counts.