

Microbial genome component and annotation pipeline


The software under designing dedicates to perform the following analysis:

Genomic Component

  • HGT
    • Genomic Island

    • Prophage

    • CRISPR-Cas

  • Repeat Sequences
    • Tandem Repeats
    • Interspersed Repeats
  • Non-coding RNA
    • rRNA
    • tRNA
    • sRNA

    Genomic Attributes

  • Genome Survey
  • Protein Properties
  • WGS-based Species Identify
  • Function Annotation

  • General Annotation

    • SwissProt

    • Pfam

    • GO

    • KEGG

  • Target Gene Mining
    • Effectors

      • T3SS

      • T4SS

      • Secretory/Membrane/Intracellular Protein

      • Secondary Metabolite Biosynthetic Gene Clusters

    • Virulence/Pathogenicity/Resistance Gene

      • Antibiotic Resistance Genes (ARGs)

      • Pathogen Host Interactions (PHI)

      •  Comprehensive Antibiotic Resistance Database (CARD)    

    • Element Cycle
      • CAZyme
      • Nitrogen
      • Sulfur
      • Methane
    • Membrane Transport Protein (TCDB)

    Comparative Genomics

  • Collinearity
  • Positive Selection
  • SNP
  • NOTICE: It will take a long time to complete the development!


    The software was tested successfully on Windows WSL, Linux x64 platform, and macOS. Because this software relies on a large number of other software, so it is recommended to install with Bioconda.

    Step1: Install MGCA

    Method 1: use mamba to install MGCA

    # Install mamba first
    conda install mamba

    # Usually specify the latest version of MGCA

    mamba create -n mgca mgca=0.0.0

    # 上面的命令提示找不到mgca的话,用下面这条来安装
    mamba create -n mgca

    Step2: Setup database (Users should execute this after the first installation of mgca)

    conda activate mgca

    setupDB --all

    conda deactivate

    Notice: there is a little bug, users can edit the "setupDB" file located at the mgca installation path to resolve the problem. Just remove the lines after line no. 83.

    Required dependencies




    Perl & the modules

  • perl-bioperl
  • phispy 4.2.21

    R & the packages

  • ggplot2
  • wget

    In the future:

        #- gtdbtk
        #- bakta (include trnascan-se infernal piler-cr)
        #- repeatmasker (include trf)
        #- mummer4
        #- artemis (include openjdk)
        #- saspector (include trf progressivemauve prokka)
        #- lastz
        #- kakscalculator2
        #- interproscan (include emboss openjdk)
        #- eggnog-mapper (include wget)


    Print the help messages:

    mgca --help

    General usage:

    mgca [modules] [options]


  • [--PI] Calculate statistics of protein properties and print pI of all protein sequences

  • [--IS] Predict genomic island from GenBank files

  • [--PROPHAGE] Predict prophage sequences from GenBank files

  • [--CRISPR] Finding CRISPR-Cas systems in genomics or metagenomics datasets

  • Examples

    Example 1: Calculate statistics of protein properties and print pI of all protein sequences

    mgca --PI --AAsPath <PATH> --aa_suffix <.faa>

    Example 2: Predict genomic island from GenBank files

    mgca --IS --gbkPath <PATH> --gbk_suffix <.gbk>

    Example 3: Predict prophage sequences from GenBank files

    mgca --PROPHAGE --gbkPath <PATH> --gbk_suffix <.gbk> --phmms <Path of pVOG.hmm> --phage_genes <1> --min_contig_size <5000> --threads <6>

    Example 4: Finding CRISPR-Cas systems in genomics or metagenomics datasets

    mgca --CRISPR --scafPath <PATH> --scaf_suffix <.fa> --casDBpath <db path> --threads <6>



    Results/PI/*.pepstats: Peptide statistics for each protein sequence organized by the genome.

    Results/PI/*.pI: Protein isoelectric point and its frequency.

    Results/PI/*.pI.tiff: A plot drawing 'Relative frequency' vs. 'isoelectric point'.


    Results/IS/All_island.list: A list file containing genomic island information.

    Results/IS/All_island.txt: A file contains information and sequence of genes in the genomic island.


    Results/PROPHAGE/*_prophage: Result for each genome.

    Results/PROPHAGE/All.prophages.txt: The summary results (for all genomes) include information of prophage on the host genome.

    Results/PROPHAGE/All.prophages.seq: The summary results (for all genomes) include information of prophage genes and sequences.


    Results/CRISPR/*_intially: Results obtained by permissive BLAST parameters (In most cases, it can be ignored).

    Results/CRISPR/*_filtered: The results obtained after *_intially quality control (The final result).

    Results/CRISPR/*_filtered/*.csv: The file contains information of CRISPR array.

    Results/CRISPR/*_filtered/*.png: The visualizations of all predicted CRISPR array, as shown below:


    MGCA is free software, licensed under GPLv3.

    Feedback and Issues

    Please report any issues to the issues page or email us at


    If you use this software, please cite: Hualin Liu. MGCA: microbial genome component and annotation pipeline. Available at GitHub



