攻略还发在https://github.com/gotouerina/GenomeMethPipeline 这个更全,如果好用可以点个star
需要用GPU,V100或者A100都行, 4090什么的估计不行
下机是FASTA5格式,转成pod5格式储存。
pod5软件应该可以用pip装
pod5 convert fast5 fast5/*.fast5 --output pod5/ --one-to-one ./fast5
然后用dorado提取修饰,这一步要GPU操作
/groups/lzu_public/home/u220220932211/software/dorado-0.7.1-linux-x64/bin/dorado/dorado basecaller /groups/lzu_public/home/u220220932211/software/dna_r10.4.1_e8.2_400bps_hac@v5.0.0/ /groups/lzu_public/home/u220220932211/work/ --modified-bases-models /groups/lzu_public/home/u220220932211/software/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v1/ > calls.bam
最后用modkit提取甲基化信息
modkit pileup calls.bam out.bed --log-filepath pileup.log
提取结果长这样:
官方解释:
Definitions:
Nmod - Number of calls passing filters that were classified as a residue with a specified base modification.
Ncanonical - Number of calls passing filters were classified as the canonical base rather than modified. The exact base must be inferred by the modification code. For example, if the modification code is m (5mC) then the canonical base is cytosine. If the modification code is a, the canonical base is adenosine.
Nother mod - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with h calls, 1 with a canonical call, and 2 with m calls. In the bedMethyl row for h Nother_mod would be 2. In the m row Nother_mod would be 3.
Nvalid_cov - the valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical, also used as the score in the bedMethyl
Ndiff - Number of reads with a base other than the canonical base for this modification. For example, in a row for h the canonical base is cytosine, if there are 2 reads with C->A substitutions, Ndiff will be 2.
Ndelete - Number of reads with a deletion at this reference position
Nfail - Number of calls where the probability of the call was below the threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.
附图