利用docker容器进行metascape富集分析还是很简单的,默认分析human基因list,注意-S参数

利用docker容器进行metascape富集分析还是很简单的,默认分析human基因list,注意-S参数。

ps:

我测试的路径在:sftp://root@ip:22/home/softwares/MSBio/data
命令:bin/ms.sh -u -o /data/output_single_id_txt /data/example/single_list_id.txt

这里要特别注意!!!


Introduction

Metascape for Bioinformaticians (MSBio) enables Metascape analyses to be carried out in batch mode using users' own hardware. Metascape is a complex piece of software with many third-party dependencies (luckily no commercial ones), so Docker container technology is used to run Metascape offline. If you do not have access to a Docker infrastructure, please continue to use the Metascape.org web site. Although we have tested Docker on Mac (M1/2 chip is not supported) and Windows, the instructions below are written with Linux in mind.

To run MSBio, you need two docker images (~3GB each) and a valid license. Free MSBio licenses are available for non-web and non-commercial use only. To run MSBio within a commercial entity, please check out what commercial license enables.

Installation

Before you start, make sure you have access to a Docker infrastructure.
The installation package can be obtained by register for a free license (valid for a year). Create a new folder for MSBio, unzip the installation zip file to get three subfolders:

$ unzip msbio_v3.5.20210815.zip
$ lt

bin data license

To download MSBio docker images, enter the MSBio working folder, run:

bin/install.sh

Note: winbin/install.bat for Windows.

As each image is ~3GB in size, be patient as they are downloaded. If successful, you should see two Docker images (Image ID and Size varies):

$ docker image list

REPOSITORY TAG IMAGE ID CREATED SIZE

metadocker8/msdata latest 99ea12d0ba82 30 hours ago 3.23GB

metadocker8/msbio latest 7914a952382e 30 hours ago 3.34GB

Metascape Docker container requires a minimum of 6GB memory to run, as the database consumes memory. We recommend providing 8GB+.

Note: the installation script also creates a data folder and makes it writable to all users (MSBio creates files with user ID 1002).

Usage

Container Management

The MSBio containers must be running in order to do metascape analysis. To launch the containers, at the MSBio work folder:

bin/up.sh

After you are done with your analyses, shut down the containers to save memory resource:

bin/down.sh

Gene-List Analysis

To analyze your gene list(s), use bin/ms.sh. The minimum syntax is:

bin/ms.sh -o output_folder input_list_file

However, since the input file formats are different depending on whether you are doing a single-gene-list or multi-gene-list analyses, you must specify -u if your input format follows the single-gene-list standard. Another very important point is both output_folder and input_list_file must be subfolders of data, as the data working folder is mounted inside the container as /data and therefore files within are visible inside the Docker container. Also, both output_folder and input_list_file must start with /data, since this is the path within the container! (Since version v3.5.20211016, the "data" path not starting with "/" will be automatically prepended and will work.)

Example for single-gene-list analysis:

bin/ms.sh -u -o /data/output_single_id_txt /data/example/single_list_id.txt

Here, -u stands for "unique", as our genes are in a column. This is important because the file format of (.txt, .csv, or .xlsx) for a single gene list is different from the format used for multiple gene lists. The exact format of the input files is described in the online menu, and example input files are also available under the data/example folder. Our recommendation is to always use the multiple gene list format, we might need to retire the single gene list format in the future, as it causes some confusion.

If the gene list is not for human, use -S. Please read the next section for important options.

If your analysis command crashes without error, chances are the process within the container was killed due to insufficient memory, so it did not get a chance to complain. You should make sure your Docker server allows 8GB+ for the container.

Example of multi-gene-list analysis:

bin/ms.sh -o /data/output_multiple_sym_txt /data/example/multiple_list_symbol.txt

Advanced Options

MSBio supports many options, however, you can ignore most of them. We here only explain a few important ones:

-o OUTPUT, --output OUTPUT

The output folder path must be provided. It must start with /data/ as this is the path within the container.

-u, --one_list

This is important, when your input uses the single-list file format.

-p, --PPI

By default, MSBio perform PPI network analysis. If you do not want PPI analysis, use this option. (Note: MSBio alpha did not run PPI by default, we change the behavior in beta)

-G, --skip_go

By default, MSBio performs GO enrichment analysis. If you would like to skip, use this option.

-t ID_TYPE, --id_type ID_TYPE

ID type of genes in the input file. By default, you do not need to specify and let Metascape auto-guess. But you can also force Metascape to interpret your IDs as one of the following types: "Entrez", "RefSeq", "Symbol", or "dbxref". Type strings are case-sensitive.

-s, --skip_convert

If you are pretty sure the input gene IDs are already correct Entrez Gene IDs, you can use this option to skip the ID conversion and slightly speeds thing up.

-S SOURCE_TAX_ID, --source_tax_id SOURCE_TAX_ID

By default, Metascape treats the source organism as human, if it is not, you can specify the source taxonomy ID using this option.

-T TARGET_TAX_ID, --target_tax_id TARGET_TAX_ID

By default, Metascape treats the target organism as human, if it is not, you can specify the target taxonomy ID using this option.

--option option.json

All settings for Metascape "Custom Analysis" and more can be changed using a JSON file. data/example/option.json is an example file containing all default settings. This is what is used if the --option is not provided. You can provide your own option.json file to customize ontology categories and annotation categories. Although not recommended, you can even overwrite gene list and PPI network size limits.

For -S and -T you can use either taxonomy ID or common names. The supported IDs are: 9606, 10090, 10116, 4932, 5833, 6239, 7227, 7955, 3702, and 4896. The supported names are: human, mouse, rat, yeast, malaria, "c. elegans", fly, zebrafish, arabidopsis, or "s. pombe".

Batch Processing

At the beginning of each bin/ms.sh run, it first needs to load databases. If you need to run multiple tasks and you would like to avoid this overhead, you can use a .job file as the input file, see data/example/test.job. This way the databases are only loaded once and Metascape can run multiple tasks afterward. for examples.

Each line in a .job file is a JSON-format description of a Metascape task. You must minimally specify the input, output, and "single":true if input file format uses the single-gene-list standard (equivalent to the -u option). You can even provide job-specific option.json file, if you want to alter the default behavior.

To run the job file:

bin/ms.sh /data/example/test.job

Since v3.5.20211016, you may also omit the "/" in the beginning of the input and output arguments, e.g.:

bin/ms.sh data/example/test.job

For debugging purpose, if you want to skip a task, use "#" to comment out that task line.

When Metascape executes a task, it encloses the output message within two lines, starting with 'START>' and 'COMPLETE>'. For example:

START> job #12, input=/data/example/multiple_list_id_bg.xlsx, output=/data/output_multiple_id_xlsx_bg
...
Cytoscape Free Memory: 1531
COMPLETE> job #12, input=/data/example/multiple_list_id_bg.xlsx, output=/data/output_multiple_id_xlsx_bg

If a task line is commented out or the input or output path for a task is missing, there will be a line:

SKIP> job #1

This supposedly makes it easier for you to parse the batch processing output to identify the failed tasks.

Parallel Metascape Analyses

When one bin/ms.sh is running, you must not execute another bin/ms.sh command! This is because the backend plotting components can only plot one task at a time, so if you run two ms.sh simultaneously, plots from two gene lists may cross0talk with each other.

In case you really need to run multiple tasks in parallel, you need to use multiple MSBio containers, which is isolated from each other. Each ms.sh process in a container should only process one gene list at a time!

As an example, to launch two MSBio containers, do:

bin/up.sh
bin/up.sh 2

There will be two MSBio containers running, named msbio1 and msbio2. The first command, bin/up.sh names the container as msbio1 by default; it is equivalent to bin/up.sh 1. Now you can use both containers in parallel. The following two commands can be run at once:

bin/ms.sh -u -o /data/output_single_id_txt /data/example/single_list_id.txt &
bin/ms.sh 2 -o /data/output_multiple_sym_txt /data/example/multiple_list_symbol.txt

bin/ms.sh 2 means run the command using the msbio2 container. bin/ms.sh followed by "-" means msbio1 is used. You may also use bin/ms.sh 1, if you want to be explicitly using msbio1.

To shut down both containers:

bin/down.sh
bin/down.sh 2

If you need more containers, just follow this usage pattern. To minimize resource consumption, only msbio1 runs the database server, and all other containers talk to msbio1. So bin/down.sh 1 will only work if there are no other containers depending on msbio1.

Mac and Windows

If you install Docker Desktop for Mac/Windows, MSBio does work in our tests. For MAC, commands are the same in the examples above. For Windows, the scripts are in the winbin folder instead. So commands bin/up.sh, bin/down.sh, bin/ms.sh are replaced by winbin/up.bat, winbin/down.bat, and winbin/ms.bat, respectively.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 218,204评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,091评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,548评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,657评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,689评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,554评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,302评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,216评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,661评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,851评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,977评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,697评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,306评论 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,898评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,019评论 1 270
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,138评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,927评论 2 355

推荐阅读更多精彩内容