信息来源:https://www.drive5.com/usearch/manual/otu_qc.html
Amplicon reads often contain artifacts which are not filtered by my recommended pipeline because they vary widely in different datasets and it would be difficult to account for all of them in a single set of commands. It is generally easier to identify them by manually analyzing the OTU sequences rather than the reads because of the much smaller size of the dataset. Of course, if you are going to repeatedly run a pipeline with reads obtained from similar libaries, it would make sense to modify the pipeline to filter the types of artifact you find.
Here, I describe qualilty control checks that I use in my own work with links to discussion and commands. If you encounter other artifacts in your data, please let me know and I will update this page.
Do the OTU sequences align well to a reference database for your gene?
I strongly recommend aligning your OTUs to a large reference database and reviewing some selected alignments. For most samples, use the largest available reference database for your gene. e.g. SILVA for 16S or UNITE for ITS. If you have control samples (mock community or single strain), then you should align the control OTU sequences to the known sequences in the strain(s).
Check highest and lowest abundance, shortest and longest sequence
You can select a random subset to review by using the fastx_subsample command. I recommend reviewing the top few most abundant OTUs, and also a few of the least abundant. These alignments are especially likely to reveal problems. Also check the shortest and longest sequences.
Here is a typical command for making the alignments.
usearch -usearch_global otus.fa -db silva.udb -id 0.9 -strand both \
-alnout otu.aln -uc otu.uc
Review the alignments in the otu.aln file. Look for patterns which might not be biological. For example, if you see that many OTUs have mismatches near the beginning or end of the sequence, this may be due to primer-binding sequences which should be stripped because PCR tends to cause substitutions.
To get a quick overview of the identity distribution, sort field 4 of the .uc file (%id, see uc file format).
cut -f4 otu.uc | sort -g | uniq -c > id_dist.txt
Do you have many OTUs with 100% id to your reference database? If not, you should figure out why. Review the alignments for some of the OTUs with the highest identity. You may notice that most of these OTUs have mismatches at the same positions, which suggests that there are non-biological sequences in the reads which need to be stripped, e.g. a machine-specific sequence like the 454 TCAG calibration sequence.
The strand is the 5th field in the uc file, so to check that all OTUs are on the same strand:
cut -f5 otu.uc | sort | uniq -c
If you find that you have OTUs on both strands, then you need to orient the reads onto the same strand before quality filtering and re-build the OTUs.
Do all OTUs appear in the OTU table?
All OTU sequences should be found in your OTU table. Here is how to analyze the problem if some are missing.
The three main explanations are:
E1. Some reads matches more than one OTU at 97%, in which case only one of these OTUs may appear in the OTU table.
E2. Some OTUs have matches in the reads, but these matches are not found due to false negatives in the search performed by otutab.
E3. There is a mistake in your pipeline design causing some reads to be included when making the OTUs but excluded when making the OTU table.
Identify the missing OTU sequences
Find the labels of the missing OTUs and make a FASTA file missing.fa with the sequences. You can do this by cutting the labels out of the OTU table and grepping the labels from the FASTA file with the OTU sequences, then using the Linux uniq command to find labels which appear only in the FASTA file. Finally, use the fastx_getseqs command to extract those sequences. For example,
cut -f1 otutab.txt | grep -v "^#" > table_labels.txt
grep "^>" otus.fa | sed "-es/>//" > seq_labels.txt
sort seq_labels.txt table_labels.txt table_labels.txt | uniq -u > missing_labels.txt
usearch -fastx_getseqs otus.fa -labels missing_labels.txt -fastaout missing.fa
Strand duplicates and offsets
Common causes of (E1) are strand duplicates and offsets. You can check for these problems by aligning the missing OTUs to the original OTUs with the missing OTUs removed. If there are matches, then this is most likely due to either strand duplicates or offsets. This will be clear from the alignments: a strand duplicate will align on the negative strand and an offset will align with terminal gaps, which can be seen by noting that the start of the alignment in one of the sequences is not at position 1. To make the database with the not-missing OTUs, you can use the Linux uniq command again, as follows.
sort missing_labels.txt missing_labels.txt seq_labels.txt | uniq -u > notmissing_labels.txt
usearch -fastx_getseqs otus.fa -labels notmissing_labels.txt -fastaout notmissing.fa
Then align the missing OTUs to the not-missing OTUs and examine the alignments:
usearch -usearch_global missing.fa -db -notmissing.fa -strand both -id 0.97 \
-uc missnot.uc -alnout missnot.aln
False negatives by otutab
If only a few OTUs are missing, say one or two, then this may be due to (E2), i.e. a problem with the otutab command. It uses heuristics to optimize speed, and in rare cases it may fail to find a match between a query sequence and an OTU sequence. You can check this by generating an OTU table using missing.fa instead of otus.fa. If you now get non-zero counts for the missing OTUs, then one possible explanation is false negatives. It may also be due to offsets or strand duplicates as discussed below. If the problem is due to search heuristics, then you should be able to fix the problem by setting the -maxrejects option of otutab to a higher value, say -maxrejects 1000.
Tight OTUs
Another cause of (E1) is tight OTUs, i.e. OTUs that have >97% identity. This applies only to cluster_otus, not to denoising (unoise3). See tight OTUs for how to identify this problem.
Check your pipeline
You can't use the same FASTA file as input to OTU clustering (cluster_otus or unoise3) and to otutab because otutab requires sample labels in the sequence labels, so a given unique sequence usually appears many times, while the input to OTU clustering requires exactly one copy of each unique sequence with a size annotation. This adds a complication to the analysis pipeline which could be prone to mistakes, so if none of the above checks identified the problem, then you should make sure that the set of sequences used as input to otutab has all the sequences used to make the OTU sequences. In fact, the input to otutab should generally contain more sequences because I usually recommend using unfitered reads, including singletons, to make the OTU table, while the OTU sequences should be generated from filtered reads with singletons discarded.
Coverage
How much of the data is explained by the OTUs?
Explaining your data
Measuring coverage attempts to answer the question "what fraction of the reads is explained by my OTUs?". This gives a sense of whether the OTUs have good coverage. For example, if 95% of your reads can be explained by the OTUs, then you might feel that the coverage is good. The 5% of reads which are not explained probably contain a mix of rare biological sequences and artifacts such as sequencer errors and chimeras, but with low-abundance sequences these cannot be reliably distinguished so we have to accept some loss in sensitivity to get a reasonably low rate of spurious OTUs.
Analyzing a control sample such as a staggered mock community can be used to calibrate parameters to get the highest sensitivity with an acceptable rate of spurious OTUs.
This page describes how you can investigate coverage indepedently of control samples, e.g. because you didn't sequence any.
Classifying reads
Reads can be divided into five groups as shown in the figure.
(a) Reads that map to OTUs. If we assume that the OTUs are correct biological sequences, then these reads are correct or within 3% of a correct sequence. These are the reads which are counted in the OTU table if you follow my recommendations for mapping reads to OTUs. I think it is reasonable to consider these reads explained by the OTUs, while acknowledging that there may be some cases where multiple species are lumped into a single OTU.
(b) Reads of an OTU with >3% errors. The vast majority of these will have been discarded because they do not map to an OTU.
(c) Correct reads of chimeric amplicons. Most likely, only a small minority of OTU sequences will be chimeric because the cluster_otus and unoise3 commands have very effective chimera filters. If a chimera is <3% diverged from an OTU, some or all of its reads will be mapped to that OTU. I consider this to be harmless for all practical purposes because chimeras are much less abundant than their parents, so the number of chimeric reads added to the OTU count will be small compared to other fluctuations. The large majority, or perhaps all, chimeras in your reads will have parents that are in the OTUs because chimeras are strongly biased to form between high-abundance parents, for pretty obvious reasons.
(d) Chimeric reads with errors. If a chimeric read has errors, say >3% compared with the true chimeric amplicon sequence, then it is impossible to distinguish it reliably from a noisy read of one of the parent sequences or a biological sequence missing from the OTUs.
(e) Reads of biological sequences that are not in the OTUs. We would like to estimate the size of this set, but this cannot be done reliably because the only way to do that is to estimate the sizes of (a) through (d) and see what is left over, but the uncertainties in those sizes are comparable with the size of (e). The calculation ends up being something like (e) = 100% minus (90% +/- 10%), so you really don't have much of an idea how big (e) is -- it could be anywhere between 0% and 20%.
Accounting for reads which do not map to an OTU at 97% identity
Make a FASTQ file for the unmapped reads by using the -notmatchedfq option of the otutab command.
We can safely assume that a large majority of low-quality reads which did not map belong to the OTUs and chimeras formed from the OTUs. This is because the OTUs represent the most abundant template sequences, and most bad reads will be derived from those templates just as most good reads are. With this in mind, pick an expected error threshold, say 2, and consider the reads above the threshold to be explained as noisy reads of known OTUs.
usearch -fastq_filter unmapped.fq -fastq_maxee 2.0 -fastaout unmapped_hiqual.fa \
-fastaout_discarded unmapped_loqual.fa
Now let's try to find chimeric reads. Use the -chimeras option of unoise3 to get the predicted chimeric amplicons. Do this even if you're only using 97% OTUs in your analysis, because this is the best way to get an accurate set of chimera predictions.
If an unmapped read is within, say, 5% of a chimera or an OTU, then it is probably a noisy read with more errors than expected from its Phred scores. Combine the OTUs and chimeras into one database file and use this command to check.
usearch -usearch_global unmapped_hiqual.fa -db otus_chimeras.fa -strand plus -id 0.95 \
-matched unmatched_noisy.fa -notmatched unmatched_hiqual_other.fa
Now you have the number of mapped reads and counts of low quality reads and noisy reads with apparently high quality. If you add these up, it should account for most of your data.
One way (the only way?) to check for valid biological sequences that are not accounted for in the OTUs is to search the remainng reads against a large database like SILVA, e.g.
usearch -usearch_global unmatched_hiqual_other.fa -db silva.udb -strand both \
-id 0.99 -alnout unmapped_silva.aln
I specified a high identity threshold because if you use a lower threshold, say 0.97, you may get spurious alignments from noisy reads. If the identity is much below 100%, it is difficult to decide.
Bad sequencing construct created by PCR
The PCR reaction used to create a sequencing library can create anomalous short constructs due to e.g. to secondary structure formation. These can easily be recognized by using fastx_info to check the length distribution ("Lengths ... min" gives the shortest and "low" the first quartile), or use sortbylength and look at the sequences towards the end of the output file. Using sortbylength with the minseqlength option removes the short sequences. You could also use the fastq_minmergelen and fastq_maxmergelen options of fastq_mergepairs. You can use PCR amplicon prediction on a reference database for your gene to determine the valid length range for your amplicons.
Sequences of both plus and minus strands
If reads are created from both strands of the gene, then you will tend to get duplicated OTUs where one is the reverse-complement of the other.
To check for reads or OTU sequences on both strands, use the orient command with -tabbedout orient.txt. Any reference database will do for a quick check, though a large reference database is recommended for orienting the reads in a production pipeline. To get the number of sequences on each strand, use the following Linux command:
cut -f2 orient.txt | sort | uniq -c
All your OTUs should be on the same strand. If not, you need to adjust the pipeline to perform orientation before dereplication (the fastx_uniques step). You could also use:
usearch -cluster_fast otus.fa -id 0.97 -strand both \
-userout user.txt -userfields query+target+qstrand
Sequences start at different positions in the gene
Sometimes, reads start at different positions in the gene due to using a mix of primers which bind to different locations. For example, Kozich et al. 2013 used a mix with forward primers for V3 and V4 with reverse primers for V4 and V5, giving three type of amplicon: V3-V4, V4, and V3-V5. These must be separated before making OTUs. In other cases, there may be "staggered" primers with small offsets (like GATTACA and ATTACAT, which are offset by one base) to increase image diversity (similar to using a PhiX spike-in) or to reduce the number of species with mismatches. With staggered primers, the reads must be trimmed to the same position before dereplication, so in my toy example the G should be deleted from reads starting with GATTACA (this is an example of what I call global trimming). Actually, the cluster_otus and unoise3 commands are designed to handle some types of staggered primer, so this may not be necessary. You can check for offset sequences using cluster_fast because it does not count terminal gaps as differences. For example,
usearch -cluster_fast otus.fa -id 0.97 -strand both -alnout otus.aln -show_termgaps \
-userout user.txt -userfields query+target+qstrand+qlo+qhi+tlo+thi
Review the alignments for terminal gaps (note the -show_termgaps option) or look for qlo or tlo values in user.txt which are >1, e.g., this Linux command will show qlo values which are not 1:
cut -f4 user.txt | grep -v "^1$"
Another way to check is to compare the OTU sequences to the unique sequences, as follows.
usearch -usearch_global uniques.fasta -db zotus.fa -strand both \
-id 1.0 -maxaccepts 4 -maxrejects 64 -userout uniques_vs_zotus.txt \
-userfields query+target
cut -f1 uniques_vs_zotus.txt | sort | uniq -d
cut -f2 uniques_vs_zotus.txt | sort | uniq -d
If more than one unique sequence matches a given OTU with 100% identity, or vice versa, then you must either have offset sequences in your reads or strand duplicates in your OTUs.
Reads assigned to the wrong sample.
Cross-talk errors assign reads to incorrect samples
In marker gene amplicon sequencing, samples are often multiplexed into a single run by embedding index sequences into amplicons to identify the sample of origin. Reads are assigned to samples (demultiplexed) according to their index sequences. A cross-talk error occurs when a read is assigned to an incorrect sample.
Illumina has a ~0.1% cross-talk error rate
The cross-talk error rate was estimated to be ~0.1% in twelve Illumina datasets including one single-indexed GAIIx run and eleven dual-indexed MiSeq runs, as described in the UNCROSS paper. This means that on average, 0.1% of reads are assigned to the wrong sample. However, there are fluctuations, so in some OTUs the rate can be substantially higher. In a given OTU, the number of reads assigned to a single sample could be inflated by up to ~1% of the total reads in that OTU. Thus, if the OTU table shows that up to around 1% of the reads were assigned to a given sample, the correct count could be zero and this would then give a false-positive identification of the species (or group of species) in the OTU. Cross-talk thus tends to inflate estimates of richness and alpha diversity. Beta diversity may also be inflated because samples may appear to share the same spurious OTUs.
Other next-generation machines seem to generate similar cross-talk rates. The reasons are unclear, though at least some cross-talk is probably due to read errors in the index (also called tag or barcode) sequences.
Control samples
Cross-talk can be identified most reliably in control samples such as a null sample (e.g. distilled water) and designed (mock) communities where the sequences are known.
Identifing and filtering cross-talk
The otutab_xtalk command attempts to identify and filter cross-talk in an OTU table.
Polymerase errors and bad base calls
Almost certainly, some of your OTUs will be spurious due to sequence errors due to amplification (PCR substitution errors and chimeras) and sequencing. Unfortunately, these are very difficult to identify, even in control samples, which is why I strongly recommend expected error filtering and discarding singletons.
The only reasonably reliable way to test for spurious OTUs due to sequence errors is to use a control sample with known sequences, i.e. a single strain or a mock community. See control samples for discussion.
If you have a control sample, then you will probably find that the reads for the controls have spurious OTUs due to cross-talk, so you should try to distinguish these from spurious OTUs due to sequence errors. You can do this by using the uncross command and manually reviewing the predictions, or by aligning OTUs to the reference sequences in your control sample and examining those which match, but not exactly, say in the range 95% to 99% identity. These OTUs are quite likely explained by sequence errors, but unfortunately other explanations are possible. Typical mock communities have strains which are common human pathogens, so if your "real" samples are from human, they may contain closely species (this often happens in data that I analyze in my own work, so this is not as paranoid as it might sound). In that case, an OTU with some differences compared with the reference sequence might be the correct sequence of a strain in the other samples. This can be tested to some extent by making OTUs from the control samples only, keeping in mind that this will not eliminate cross-talk. Also, it is impossible to distinguish chimeras from sequence errors when the number of differences is small. Bottom line, with some effort you can probably get a good sense of which OTUs in your mock samples have sequence errors, but you can't be sure.
If it seems likely that your controls samples have an unacceptable number of OTUs with sequence errors, then you can adjust the pipeline by increasing the stringency of the quality filtering. There are two ways to do this: reduce the expected error filtering threshold, and increase the unique sequence abundance threshold to discard sequences with abundance two or higher instead of just singletons. With some trial and error, you can tune these parameters by looking at: (1) the accuracy of the control OTUs as described above, and (2) by measuring coverage. If coverage drops too far, you may be filtering too stringently.
Sequencer noise
Low-complexity reads may be generated during sequencing. For example, Illumina reads typically contain low-complexity noise if they continue past the reverse sequencing adapter due to a short construct. Low-complexity sequence can be detected and filtered using the filter_lowc command.
Unfiltered spike-in
A PhiX spike-in is used for Illumina sequencing of amplicon libraries for 16S and other marker genes. Then, PhiX is typically found in around 5% of the raw reads. Often, these are filtered before you see the FASTQ files, but not always, and third-party PhiX filtering software may not be accurate. In usearch, PhiX search is based on the ublast algorithm, and is very sensitive.
The search_phix command can be used for searching and the filter_phix command can be used for filtering.
Unfiltered PCR chimeras
The cluster_otus and unoise3 commands have built-in filters for chimeras. These filters are not perfect, so there may be cases where non-chimeric OTUs were incorrectly discarded (false positives) or chimeric OTUs were not filtered (false negatives).
In the past, I have suggested using reference-based UCHIME as a post-processing step to filter chimeric OTUs. This is a bad idea with the current implementations of cluster_otus and unoise3 because the error rate of reference-based UCHIME is much higher than the error rates of the built-in de novo filters. Therefore, if you run uchime2_ref on your OTUs, the OTUs that are discarded are more likely to be false positives than true chimeras. The high accuracy claims of the original UCHIME paper were exaggerated because of unrealistic benchmark tests; this is explained in the UCHIME2 paper.
It turns out that it is impossible in principle to distinguish chimeras from correct sequences, even when there are no sequence errors and the reference database is complete. This is a very surprising, almost shocking, result which is reported in the UCHIME2 paper. The reason is "fake models", where a correct sequence can be constructed as a chimera from two other correct sequences. Chimeras can have identical sequences to valid genes, so it is impossible for an algorithm to distinguish the two cases from a sequence alone. Fake models are common in practice, hence the problem.
So, what should you do? I would suggest running uchime2_ref with the largest possible database (which would be SILVA for 16S). Review the results with -mode balanced and -mode high_confidence. Usually, I find that -mode high_confidence is more useful because -mode balanced gives too many questionable predictions. With high_confidence, you will probably see a small number of quite convincing chimeric alignments. It is then a judgement call whether you think these are false positives due to fake models or true chimeras. It's anyone's guess, because it is impossible to distinguish these two cases. If you get a lot of convincing alignments and you think this may be due to problems with the built-in filter in cluster_otus or unoise3, then by all means send them to me for review.
Primers amplify a different region
Sometimes primers amplify the wrong target, e.g. human DNA instead of 16S.
One way to identify this type of OTU is use ublast to search a large nucleotide database (e.g. NCBI nt), or submit the OTUs to the NCBI BLAST web site. Take the top hit, or top few hits, for each OTU, delete those containing expected strings for your gene such as "16S" or "SSU" and examine the rest more closely.
Another method is to use the sintax command to predict taxonomy and examine those with low boostrap confidence at domain or phylum level. This should give a smaller set which you can examine further, e.g. by submitting to NCBI BLAST.
A third option is to use the usearch_global command to align the OTUs to a large sequence database such as SILVA. Examine any which fail to align or have low identities (say, <80%).
Self-explanatory
Contaminant species can be introduced into your samples at every stage in the experimental protocol from sample collection to sequencing (e.g., residue on the flow-cell).
Detecting contaminants is difficult/impossible except in control samples e.g., distilled water, or a mock community. Control samples can be used to detect contaminants introduced in the stages from library preparation to sequencing. Note that cross-talk can also cause unexpected sequences in a control sample.
Primer-binding sequences should be stripped at the start of the pipeline
Usually, the primer-binding sequences in your gene (e.g., 16S) are included in the reads. These should be stripped out, because the PCR reaction tends to cause substitutions in those sequences (most often to remove any mismatches).
You can check whether they appear in the OTUs by using the search_oligodb command, e.g.:
usearch -search_oligodb otus.fa -db primers.fa -strand both \
-userout primer_hits.txt -userfields query+qlo+qhi+qstrand
If primers have not been stripped, you will see many hits to the forward primer at the start of the OTUs. If you have full-length paired reads without length trimming, you may also see many hits to the reverse primer at the end of the OTUs. If the primers were trimmed incorrectly, e.g. the primer is 20nt but you only stripped 16, this command won't catch the problem because all the letters in the database sequence must be included in the alignment (substitutions are allowed but gaps are not allowed, even terminal gaps).
Primer stripping should be done before quality filtering (because every base increase expected errors) and before finding unique sequences (because variation in the primer-binding region will split over biological sequence over several uniques, degrading the calculation of unique sequence abundance), so if primers do appear in the OTUs then you should go back and fix the pipeline.
OTUs >97% identical
This problem applies to 97% OTU clustering, not to denoising (ZOTUs). The cluster_otus command is designed to generate a set of OTU sequences that are <97% identical to each other. If a pair of OTU sequences is >=97% identical, this is a mistake by the algorithm which is sometimes caused by the heuristics used for speed.
This is a pretty benign problem -- the 97% threshold is arbitrary anyway, so if some clusters are a bit "tighter", say 98% separated rather than 97%, it probably doesn't make any practical difference to the biological results.
To check for it, you can cluster with UCLUST with parameters chosen for high sensitivity rather than speed. For example,
usearch -cluster_fast otus.fa -id 0.97 -maxaccepts 4 -maxrejects 128 \
-top_hit_only -uc hits.uc -centroids new_otus.fa
With this command, there would ideally be no hits because each OTU should be in its own cluster. To check for hits and see the identities,
grep "^H" hits.uc | cut -f4 | sort -g
If you do get hits, then you can use new_otus.fa as an improved set of OTU sequences.