转自:https://github.com/sunzhengCDNM/MAP2B
Installation
System requirements
Dependencies
All scripts in MAP2B are programmed by Perl and Python, and execution of MAP2B is recommended in a conda environment. This program could work properly in the Unix systems, or Mac OSX, as all required packages can be appropreiately download and installed.
Memory usage
> 14G RAM is required to run this pipeline.
Download the pipeline
Clone the latest version from GitHub (recommended):
git clone https://github.com/sunzhengCDNM/MAP2B/
cd MAP2B
This makes it easy to update the software in the future usinggit pullas bugs are fixed and features are added. Alternatively, directly download the whole GitHub repo without installing GitHub:
wget https://github.com/sunzhengCDNM/MAP2B/archive/master.zip
unzip master.zip
cd MAP2B-master
Install MAP2B in a conda environment
Conda installation
Miniconda provides the conda environment and package manager, and is the recommended way to install MAP2B.Create a conda environment for MAP2B pipeline:
After installing Miniconda and opening a new terminal, make sure you’re running the latest version of conda:
conda update conda
Once you have conda installed, create a conda environment with the yml fileconfig/MAP2B-20230420-conda.yml.
conda env create -n MAP2B.1.5 --file config/MAP2B-20230420-conda.yml
Activate the MAP2B conda environment by running the following command:
conda activate MAP2B.1.5 or source activate MAP2B.1.5
Make sure the conda environment of MAP2B has been activated by running the above command before you run MAP2B everytime.
The workflow begins by checking the database's existence, and if it is not found, the corresponding database will be downloaded automatically to the software installation path. This download process may take some time, but it ensures that the necessary databases are readily available for the workflow. Alternatively, you can also download the GTDB database and RefSeq database independently using the following commands:
for GTDB database
python3 scripts/DownloadDB.py -l config/GTDB.CjePI.database.list -d database/GTDB
for RefSeq database
python3 scripts/DownloadDB.py -l config/RefSeq.CjePI.database.list -d database/RefSeq
Now, everything is ready for MAP2B :), Let's get started.
Using MAP2B
Quick start
MAP2B is a highly automatic pipeline, and only a few parameters are required for the pipeline.
We prepared a real pair-end sequencing data of a MOCK community:
cd example
mkdir -p data/
wget -t 0 -O data/shotgun_MSA-1002_1.fq.gz https://figshare.com/ndownloader/files/38346149/shotgun_MSA-1002_1.fq.gz
wget -t 0 -O data/shotgun_MSA-1002_2.fq.gz https://figshare.com/ndownloader/files/38346155/shotgun_MSA-1002_2.fq.gz
After downloading the sequencing data, we can finally run MAP2B:
python3 ../bin/MAP2B.py -i data.list
Indata.list you can learn how to prepare your input data, both single-end and paired-end data can be used as input.
sample1 <tab> shotgun1_left.fastq(.gz) <tab> shotgun1_right.fastq(.gz)
sample2 <tab> shotgun2.fastq(.gz)
sample3 ...
Parameters
The main program is bin/MAP2B.py in this repo. You can check out the usage by printing the help information via python3 bin/MAP2B.py -h.
usage: MAP2B.py [-h] -i INPUT [-o OUTPUT] [-d DATABASE] [-p PROCESSES] [-g GSCORE]
optional arguments:
-h, --help show this help message and exit
-i INPUT The filepath of the sample list. Each line includes an input sample ID and the file path of corresponding DNA sequence data where each field should be separated by <tab>. The line in this file that begins with # will be ignored.
sample <tab> shotgun.1.fq(.gz) (<tab> shotgun.2.fq.gz)
-o OUTPUT Output directory, default ./MAP2B_result
-s {GTDB,RefSeq} Data source, choose from GTDB or RefSeq, default GTDB
-d DATABASE Database path for MAP2B pipeline, MAP2B_path/database
-p PROCESSES Number of processes, note that more threads may require more memory, default 1
-g GSCORE Using G score as the threshold for species identification, -g 5 is recommended. Enabling G score will automatically shutdown false positive recognition model, default none
author: Liu Jiang, Zheng Sun
mail: jiang.liu@oebiotech.com, spzsu@channing.harvard.edu
last update: 2023/04/20 20:03:47
version: 1.5
If you are dealing with low-biomass samples, we recommend using the-g 3or-g 5parameters to keep as many species as possible. Although false positive detection is still a challenge for low-biomass samples, please keep in mind that the G-score ranking is highly relevant to the likelihood that a species is a true positive. Then, you can set up a threshold for G-score based on your understanding.
lishasha配置refseq数据库、GTDB数据库(CjePI酶的数据库);lishasha1配置GTDB数据库
config/RF_none_0238.v2.pkl 为作者已经训练好的假阳性过滤模型
安装路径
/usr/lishasha/biosoft/MAP2B-master