转自：https://github.com/sunzhengCDNM/MAP2B

Installation

System requirements

Dependencies

All scripts in MAP2B are programmed by Perl and Python, and execution of MAP2B is recommended in a conda environment. This program could work properly in the Unix systems, or Mac OSX, as all required packages can be appropreiately download and installed.

Memory usage

> 14G RAM is required to run this pipeline.

Download the pipeline

Clone the latest version from GitHub (recommended):

git clone https://github.com/sunzhengCDNM/MAP2B/

cd MAP2B

This makes it easy to update the software in the future usinggit pullas bugs are fixed and features are added. Alternatively, directly download the whole GitHub repo without installing GitHub:

wget https://github.com/sunzhengCDNM/MAP2B/archive/master.zip

unzip master.zip

cd MAP2B-master

Install MAP2B in a conda environment

Conda installation

Miniconda provides the conda environment and package manager, and is the recommended way to install MAP2B.Create a conda environment for MAP2B pipeline:

After installing Miniconda and opening a new terminal, make sure you’re running the latest version of conda:

conda update conda

Once you have conda installed, create a conda environment with the yml fileconfig/MAP2B-20230420-conda.yml.

conda env create -n MAP2B.1.5 --file config/MAP2B-20230420-conda.yml

Activate the MAP2B conda environment by running the following command:

conda activate MAP2B.1.5 or source activate MAP2B.1.5

Make sure the conda environment of MAP2B has been activated by running the above command before you run MAP2B everytime.

The workflow begins by checking the database's existence, and if it is not found, the corresponding database will be downloaded automatically to the software installation path. This download process may take some time, but it ensures that the necessary databases are readily available for the workflow. Alternatively, you can also download the GTDB database and RefSeq database independently using the following commands:

for GTDB database

python3 scripts/DownloadDB.py -l config/GTDB.CjePI.database.list -d database/GTDB

for RefSeq database

python3 scripts/DownloadDB.py -l config/RefSeq.CjePI.database.list -d database/RefSeq

Now, everything is ready for MAP2B :), Let's get started.

Using MAP2B

Quick start

MAP2B is a highly automatic pipeline, and only a few parameters are required for the pipeline.

We prepared a real pair-end sequencing data of a MOCK community:

cd example

mkdir -p data/

wget -t 0 -O data/shotgun_MSA-1002_1.fq.gz https://figshare.com/ndownloader/files/38346149/shotgun_MSA-1002_1.fq.gz

wget -t 0 -O data/shotgun_MSA-1002_2.fq.gz https://figshare.com/ndownloader/files/38346155/shotgun_MSA-1002_2.fq.gz

After downloading the sequencing data, we can finally run MAP2B:

python3 ../bin/MAP2B.py -i data.list

Indata.list you can learn how to prepare your input data, both single-end and paired-end data can be used as input.

sample1 <tab> shotgun1_left.fastq(.gz) <tab> shotgun1_right.fastq(.gz)

sample2 <tab> shotgun2.fastq(.gz)

sample3 ...

Parameters

The main program is bin/MAP2B.py in this repo. You can check out the usage by printing the help information via python3 bin/MAP2B.py -h.

usage: MAP2B.py [-h] -i INPUT [-o OUTPUT] [-d DATABASE] [-p PROCESSES] [-g GSCORE]

optional arguments:

-h, --help show this help message and exit

-i INPUT The filepath of the sample list. Each line includes an input sample ID and the file path of corresponding DNA sequence data where each field should be separated by <tab>. The line in this file that begins with # will be ignored.

sample <tab> shotgun.1.fq(.gz) (<tab> shotgun.2.fq.gz)

-o OUTPUT Output directory, default ./MAP2B_result

-s {GTDB,RefSeq} Data source, choose from GTDB or RefSeq, default GTDB

-d DATABASE Database path for MAP2B pipeline, MAP2B_path/database

-p PROCESSES Number of processes, note that more threads may require more memory, default 1

-g GSCORE Using G score as the threshold for species identification, -g 5 is recommended. Enabling G score will automatically shutdown false positive recognition model, default none

author: Liu Jiang, Zheng Sun

mail: jiang.liu@oebiotech.com, spzsu@channing.harvard.edu

last update: 2023/04/20 20:03:47

version: 1.5

If you are dealing with low-biomass samples, we recommend using the-g 3or-g 5parameters to keep as many species as possible. Although false positive detection is still a challenge for low-biomass samples, please keep in mind that the G-score ranking is highly relevant to the likelihood that a species is a true positive. Then, you can set up a threshold for G-score based on your understanding.

lishasha配置refseq数据库、GTDB数据库（CjePI酶的数据库）；lishasha1配置GTDB数据库

config/RF_none_0238.v2.pkl 为作者已经训练好的假阳性过滤模型

安装路径
/usr/lishasha/biosoft/MAP2B-master

MAP2B.1.5的安装与使用