1: Inference and train with existing models and standard datasets用现有模型对标准数据集进行推理训练
MMDetection provides hundreds of existing and existing detection models in Model Zoo), and supports multiple standard datasets, including Pascal VOC, COCO, CityScapes, LVIS, etc. This note will show how to perform common tasks on these existing models and standard datasets, including:
MMDetection提供几百个现有的和现有的模型在Model Zoo,并且支持多个标准数据集,包括Pascal VOC, COCO, CityScapes, LVIS等。这个文档将展示怎样在现有的模型和标准数据集中执行常见的任务。包括:
Use existing models to inference on given images.使用现有的模型对给定的数据集进行推理
Test existing models on standard datasets.测试现有的模型在标准数据集上
Train predefined models on standard datasets.训练预定义的模型在标准数据集上
Inference with existing models预测存在的模型
By inference, we mean using trained models to detect objects on images. In MMDetection, a model is defined by a configuration file and existing model parameters are save in a checkpoint file.
根据训练,意味着我们可以通过已经训练的模型对图片进行目标检测。在MMDetection,一个模型由config文件定义,并保存模型参数在checkpoint文件夹里。
To start with, we recommend Faster RCNN with this configuration file and this checkpoint file. It is recommended to download the checkpoint file to checkpoints
directory.
首先我们建议Faster RCNN使用config文件和checkpoint文件。我们建议下载checkpoint文件到checkpoint文件架中。
High-level APIs for inference高级API用于训练
MMDetection provide high-level Python APIs for inference on images. Here is an example of building the model and inference on given images or videos.
MMDetection 提供高级的 Python APIs在图片中进行训练。这里给一个例子用于建立一个模型和训练所给的图片和视频
from mmdet.apis import init_detector, inference_detector
import mmcv
# Specify the path to model config and checkpoint file
config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
# build the model from a config file and a checkpoint file
model = init_detector(config_file, checkpoint_file, device='cuda:0')
# test a single image and show the results
img = 'test.jpg' # or img = mmcv.imread(img), which will only load it once
result = inference_detector(model, img)
# visualize the results in a new window
model.show_result(img, result)
# or save the visualization results to image files
model.show_result(img, result, out_file='result.jpg')
# test a video and show the results
video = mmcv.VideoReader('video.mp4')
for frame in video:
result = inference_detector(model, frame)
model.show_result(frame, result, wait_time=1)
A notebook demo can be found in demo/inference_demo.ipynb.
Note: inference_detector
only supports single-image inference for now.
inference_detector
目前只支持图像预测
Asynchronous interface - supported for Python 3.7+
异步接口-支持python3.7+
For Python 3.7+, MMDetection also supports async interfaces.
By utilizing CUDA streams, it allows not to block CPU on GPU bound inference code and enables better CPU/GPU utilization for single-threaded application. Inference can be done concurrently either between different input data samples or between different models of some inference pipeline.
通过利用CUDA流,它允许不阻止CPU,在只限制GPU训练的程序里,使单线程程序更好的利用CPU/GPU。训练可以在 不同的输入数据样本中 或者 在相同管道的不同模型中 并发完成。
See tests/async_benchmark.py
to compare the speed of synchronous and asynchronous interfaces.
查看tests/test_runtime/async_benchmark.py
比较在同步和异步的情况下进行训练的速度。
import asyncio
import torch
from mmdet.apis import init_detector, async_inference_detector
from mmdet.utils.contextmanagers import concurrent
async def main():
config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
device = 'cuda:0'
model = init_detector(config_file, checkpoint=checkpoint_file, device=device)
# queue is used for concurrent inference of multiple images
streamqueue = asyncio.Queue()
# queue size defines concurrency level
streamqueue_size = 3
for _ in range(streamqueue_size):
streamqueue.put_nowait(torch.cuda.Stream(device=device))
# test a single image and show the results
img = 'test.jpg' # or img = mmcv.imread(img), which will only load it once
async with concurrent(streamqueue):
result = await async_inference_detector(model, img)
# visualize the results in a new window
model.show_result(img, result)
# or save the visualization results to image files
model.show_result(img, result, out_file='result.jpg')
asyncio.run(main())
Demos
We also provide three demo scripts, implemented with high-level APIs and supporting functionality codes. Source codes are available here.
我们还提供了三个演示脚本,使用高级api和支持功能的代码实现
Image demo
This script performs inference on a single image.
这个脚本对单个图片进行训练
>python demo/image_demo.py \
${IMAGE_FILE} \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
[--device ${GPU_ID}] \
[--score-thr ${SCORE_THR}]
Examples:
>python demo/image_demo.py demo/demo.jpg \
configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
--device cpu
Webcam demo
This is a live demo from a webcam.
>python demo/webcam_demo.py \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
[--device ${GPU_ID}] \
[--camera-id ${CAMERA-ID}] \
[--score-thr ${SCORE_THR}]
Examples:
>python demo/webcam_demo.py \
configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
Video demo
This script performs inference on a video.
>python demo/video_demo.py \
${VIDEO_FILE} \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
[--device ${GPU_ID}] \
[--score-thr ${SCORE_THR}] \
[--out ${OUT_FILE}] \
[--show] \
[--wait-time ${WAIT_TIME}]
Examples:
>python demo/video_demo.py demo/demo.mp4 \
configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
--out result.mp4
Test existing models on standard datasets测试现有的模型在标准数据集上
To evaluate a model's accuracy, one usually tests the model on some standard datasets.
为了评估模型的accuracy,人们通常在标准数据集上测试模型。
MMDetection supports multiple public datasets including COCO, Pascal VOC, CityScapes, and more. This section will show how to test existing models on supported datasets.
MMDetection支持多个共用数据集,包括 COCO, Pascal VOC, CityScapes以及更多。
这部分将展示怎样测试现有的模型在支持的数据集上
Prepare datasets准备数据集
Public datasets like Pascal VOC or mirror and COCO are available from official websites or mirrors. Note: In the detection task, Pascal VOC 2012 is an extension of Pascal VOC 2007 without overlap, and we usually use them together.
公开的数据集例如Pascal VOC或镜像以及coco数据集可以从官网以及镜像中得到。
It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to $MMDETECTION/data
as below.
建议下载并提取数据集,并将数据集放在MMDETECTION/data
下
If your folder structure is different, you may need to change the corresponding paths in config files.
如果你的文件夹结构不同,你可能需要改变对应的config文件中相应的路径
mmdetection
├── mmdet
├── tools
├── configs
├── data
│ ├── coco
│ │ ├── annotations
│ │ ├── train2017
│ │ ├── val2017
│ │ ├── test2017
│ ├── cityscapes
│ │ ├── annotations
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── VOCdevkit
│ │ ├── VOC2007
│ │ ├── VOC2012</pre>
Some models require additional COCO-stuff datasets, such as HTC, DetectoRS and SCNet, you can download and unzip then move to the coco folder. The directory should be like this.
一个模型要求额外的coco-stuff数据集,例如HTC, DetectoRS and SCNet ,你可以下载并解压缩,然后放在
coco文件夹下。目录如下
mmdetection
├── data
│ ├── coco
│ │ ├── annotations
│ │ ├── train2017
│ │ ├── val2017
│ │ ├── test2017
│ │ ├── stuffthingmaps</pre>
The cityscapes annotations need to be converted into the coco format using
需要将cityscapes标注转换为coco形式
tools/dataset_converters/cityscapes.py
:
>pip install cityscapesscripts
python tools/dataset_converters/cityscapes.py \
./data/cityscapes \
--nproc 8 \
--out-dir ./data/cityscapes/annotations
TODO: CHANGE TO THE NEW PATH
Test existing models测试现有的模型
We provide testing scripts for evaluating an existing model on the whole dataset (COCO, PASCAL VOC, Cityscapes, etc.).
我们提供测试脚本去评估一个现有的模型在所有数据集上(COCO, PASCAL VOC, Cityscapes等)
The following testing environments are supported:支持以下测试的环境
single GPU【单GPU】
single node multiple GPUs【单节点多GPU】
multiple nodes【多个节点】
Choose the proper script to perform testing depending on the testing environment.
根据测试环境,选择合适的脚本进行测试
># single-gpu testing
python tools/test.py \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
[--out ${RESULT_FILE}] \
[--eval ${EVAL_METRICS}] \
[--show]
# multi-gpu testing
bash tools/dist_test.sh \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
${GPU_NUM} \
[--out ${RESULT_FILE}] \
[--eval ${EVAL_METRICS}]
tools/dist_test.sh
also supports multi-node testing, but relies on PyTorch's launch utility.
tools/dist_test.sh
也支持多个节点的测试,但是依赖PyTorch's launch utility.
Optional arguments:
可选择的参数:
-
RESULT_FILE
: Filename of the output results in pickle format. If not specified, the results will not be saved to a file.RESULT_FILE
: 输出结果为pickle格式的文件名。如果没有指明,结果不会保存在一个文件里。 -
EVAL_METRICS
: Items to be evaluated on the results. Allowed values depend on the dataset, e.g.,proposal_fast
,proposal
,bbox
,segm
are available for COCO,mAP
,recall
for PASCAL VOC.EVAL_METRICS
:根据结果评估的指标。允许的值取决于数据集,例如:proposal_fast
,proposal
,bbox
,segm
对coco数据集是有效的。mAP
,recall
对于PASCAL VOC数据集是有效的。 -
Cityscapes could be evaluated by
cityscapes
as well as all COCO metrics.Cityscapes可以通过
cityscapes
以及所有coco指标进行评估 -
--show
: If specified, detection results will be plotted on the images and shown in a new window. It is only applicable to single GPU testing and used for debugging and visualization. Please make sure that GUI is available in your environment. Otherwise, you may encounter an error likecannot connect to X server
.如果指定,检测结果将绘制在图像上,并在一个新窗口显示。仅适用于单GPU测试,用于调试和可视化。请确保GUI在你的环境中可用,否则,您可能会遇到“
cannot connect to X server
“这样的错误。 -
--show-dir
: If specified, detection results will be plotted on the images and saved to the specified directory. It is only applicable to single GPU testing and used for debugging and visualization. You do NOT need a GUI available in your environment for using this option.如果指定,检测结果将绘制在图像上,并且保存在指定的文件中。仅适用于单GPU测试,用于调试和可视化。不需要确保GUI在你的环境中。
-
--show-score-thr
: If specified, detections with scores below this threshold will be removed.如果指定,分数低于此阈值的检测将被删除
-
--cfg-options
: if specified, the key-value pair optional cfg will be merged into config file如果指定,键-值对 可选 cfg 将被合并到config文件中
-
--eval-options
: if specified, the key-value pair optional eval cfg will be kwargs for dataset.evaluate() function, it's only for evaluation如果指定,对于dataset.evaluate()函数,键-值对可选的eval cfg将是kwargs,它仅用于评估
Examples
Assume that you have already downloaded the checkpoints to the directory checkpoints/
.
假设你已经下载好.pth文件到 checkpoints/
.下
-
Test Faster R-CNN and visualize the results. Press any key for the next image.
测试Faster R-CNN并可视化结果。按任何键到下一张图片
Config and checkpoint files are available here.
>python tools/test.py \
configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
--show
-
Test Faster R-CNN and save the painted images for future visualization.
测试Faster R-CNN并且保存可视化的未来的图像
Config and checkpoint files are available here.
>python tools/test.py \
configs/faster_rcnn/faster_rcnn_r50_fpn_1x.py \
checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
--show-dir faster_rcnn_r50_fpn_1x_results
-
Test Faster R-CNN on PASCAL VOC (without saving the test results) and evaluate the mAP. Config and checkpoint files are available here.
测试Faster R-CNN在 PASCAL VOC数据集(不保存测试结果),并用mAP评估。
>python tools/test.py \
configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc.py \
checkpoints/faster_rcnn_r50_fpn_1x_voc0712_20200624-c9895d40.pth \
--eval mAP
-
Test Mask R-CNN with 8 GPUs, and evaluate the bbox and mask AP.
用8个Gpu测试Mask R-CNN,并用bbox 、 mask AP评估
Config and checkpoint files are available here.
>./tools/dist_test.sh \
configs/mask_rcnn_r50_fpn_1x_coco.py \
checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
8 \
--out results.pkl \
--eval bbox segm
-
Test Mask R-CNN with 8 GPUs, and evaluate the classwise bbox and mask AP.
用8个Gpu测试Mask R-CNN,并用classwise bbox 、 mask AP评估
Config and checkpoint files are available here.
>./tools/dist_test.sh \
configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
8 \
--out results.pkl \
--eval bbox segm \
--options "classwise=True"
-
Test Mask R-CNN on COCO test-dev with 8 GPUs, and generate JSON files for submitting to the official evaluation server.
测试开发在coco数据集上训练的Mask R-CNN,用8个GPUs,并且生成JSON文件,并提交给官方的评估服务器
Config and checkpoint files are available here.
>./tools/dist_test.sh \
configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
8 \
--format-only \
--options "jsonfile_prefix=./mask_rcnn_test-dev_results"
This command generates two JSON files这个命令生成两个json文件: `mask_rcnn_test-dev_results.bbox.json` and `mask_rcnn_test-dev_results.segm.json`.
-
Test Mask R-CNN on Cityscapes test with 8 GPUs, and generate txt and png files for submitting to the official evaluation server.
测试Mask R-CNN的Cityscapes,用8个GPUS,并且生成txt 和 png文件 并提交给官方评估服务器
Config and checkpoint files are available here.
>./tools/dist_test.sh \
configs/cityscapes/mask_rcnn_r50_fpn_1x_cityscapes.py \
checkpoints/mask_rcnn_r50_fpn_1x_cityscapes_20200227-afe51d5a.pth \
8 \
--format-only \
--options "txtfile_prefix=./mask_rcnn_cityscapes_test_results"
The generated png and txt would be under `./mask_rcnn_cityscapes_test_results` directory.
Test without Ground Truth Annotations
MMDetection supports to test models without ground-truth annotations using CocoDataset
. If your dataset format is not in COCO format, please convert them to COCO format. For example, if your dataset format is VOC, you can directly convert it to COCO format by the [script in tools.
MMDetection支持使用coco数据集在没有下限时 测试模型。如果你的数据集形式不是coco形式,请你先转换成coco形式。例如,如果你的数据集是VOC形式,你可以直接转成coco形式,通过以下脚本工具
># single-gpu testing
python tools/test.py \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
--format-only \
--options ${JSONFILE_PREFIX} \
[--show]
# multi-gpu testing
bash tools/dist_test.sh \
${CONFIG_FILE} \
${CHECKPOINT_FILE} \
${GPU_NUM} \
--format-only \
--options ${JSONFILE_PREFIX} \
[--show]
Assuming that the checkpoints in the model zoo have been downloaded to the directory checkpoints/
, we can test Mask R-CNN on COCO test-dev with 8 GPUs, and generate JSON files using the following command.
假设我们已经在model zoo中下载了checkpoints文件,放在checkpoints/路径下,我们可以测试开发在coco数据集上训练的Mask R-CNN,用8个GPUs,并且生成JSON文件,使用以下命令
>./tools/dist_test.sh \
configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
8 \
-format-only \
--options "jsonfile_prefix=./mask_rcnn_test-dev_results"
This command generates two JSON files mask_rcnn_test-dev_results.bbox.json
and mask_rcnn_test-dev_results.segm.json
.
Batch Inference批训练
MMDetection supports inference with a single image or batched images in test mode. By default, we use single-image inference and you can use batch inference by modifying samples_per_gpu
in the config of test data. You can do that either by modifying the config as below.
在测试模式下,MMDetection支持单个图像或批处理图像的训练。默认情况下,我们使用单图像训练,你可以通过修改测试数据配置中的“samples_per_gpu”来使用批训练。您可以通过以下方式修改配置来实现这一点。
>data = dict(train=dict(...), val=dict(...), test=dict(samples_per_gpu=2, ...))
Or you can set it through或者你可以通过 --cfg-options
as设置为
--cfg-options data.test.samples_per_gpu=2
Deprecated ImageToTensor 弃用ImageToTensor
In test mode, ImageToTensor
pipeline is deprecated, it's replaced by DefaultFormatBundle
that recommended to manually replace it in the test data pipeline in your config file. examples:
在测试模式下,ImageToTensor管道被弃用,它被DefaultFormatBundle
取代,建议在config文件中的测试数据管道中手动替换掉它。
# use ImageToTensor (deprecated)
pipelines = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', mean=[0, 0, 0], std=[1, 1, 1]),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
# manually replace ImageToTensor to DefaultFormatBundle (recommended)
pipelines = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', mean=[0, 0, 0], std=[1, 1, 1]),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img']),
])
]
Train predefined models on standard datasets预训练模型在标准测试集上
MMDetection also provides out-of-the-box tools for training detection models.
MMDetection提供开箱即用的工具来训练检测模型。
This section will show how to train predefined models (under configs) on standard datasets i.e. COCO.
本节将展示如何在标准数据集即COCO上训练predefined模型(在[configs]下)
Important: The default learning rate in config files is for 8 GPUs and 2 img/gpu (batch size = 8*2 = 16). According to the linear scaling rule, you need to set the learning rate proportional to the batch size if you use different GPUs or images per GPU, e.g., lr=0.01
for 4 GPUs * 2 imgs/gpu and lr=0.08
for 16 GPUs * 4 imgs/gpu.
重要:配置文件中的默认学习率为8个gpu和2个img/gpu(批处理大小= 8*2 = 16)。
根据线性缩放规则,如果使用不同的GPU或每个GPU的图像,需要设置学习率与批大小成比例,例如:‘lr=0.01’对于4个GPU * 2 imgs/ GPU,‘lr=0.08’对于16个GPU * 4 imgs/ GPU。
Prepare datasets准备数据集
Training requires preparing datasets too. See section Prepare datasets above for details.
Note: Currently, the config files under configs/cityscapes
use COCO pretrained weights to initialize. You could download the existing models in advance if the network connection is unavailable or slow. Otherwise, it would cause errors at the beginning of training.
- 注意 *:
目前,在' configs/cityscape '下的配置文件使用COCO预先训练的权重来初始化。
如果网络不通或速度慢,您可以提前下载现有的型号。否则,在训练开始时就会出现错误。
Training on a single GPU在单个GPU上进行训练
We provide tools/train.py
to launch training jobs on a single GPU.
我们提供“tools/train.py”在单个GPU上启动训练工作。
The basic usage is as follows.基本用法如下。
>python tools/train.py \
${CONFIG_FILE} \
[optional arguments]
During training, log files and checkpoints will be saved to the working directory, which is specified by work_dir
in the config file or via CLI argument --work-dir
.在训练期间,日志文件和检查点将被保存到工作目录中,该目录由配置文件中的' work_dir '或通过CLI参数'——work-dir '指定。
By default, the model is evaluated on the validation set every epoch, the evaluation interval can be specified in the config file as shown below.默认情况下,模型在验证集上进行评估,评估间隔可以在配置文件中指定,如下所示。
# evaluate the model every 12 epoch.
evaluation = dict(interval=12)
This tool accepts several optional arguments, including:这个工具接受几个可选参数,包括:
--no-validate
(not suggested不建议): Disable evaluation during training.在训练期间禁用评估。--work-dir ${WORK_DIR}
: Override the working directory.覆盖工作目录。--resume-from ${CHECKPOINT_FILE}
: Resume from a previous checkpoint file.从以前的 checkpoint文件恢复。--options 'Key=value'
: Overrides other settings in the used config.覆盖所使用的配置中的其他设置。
Note:
Difference between resume-from
and load-from
:
resume-from
loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally. load-from
only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.
- 注意 *: “resume-from”和“load-from”的区别: ' resume-from '加载模型权重和优化器状态,epoch也从指定的检查点继承。它通常用于恢复意外中断的训练过程。 ' load-from '只加载模型权值,训练纪元从0开始。它通常用于微调。
Training on multiple GPUs用多个GPU训练
We provide tools/dist_train.sh
to launch training on multiple GPUs. The basic usage is as follows.
我们提供' tools/dist_train.sh '在多个gpu上启动训练。
基本用法如下。
>bash ./tools/dist_train.sh \
${CONFIG_FILE} \
${GPU_NUM} \
[optional arguments]
Optional arguments remain the same as stated above.
Launch multiple jobs simultaneously同时启动多个作业
If you would like to launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict.
If you use dist_train.sh
to launch training jobs, you can set the port in commands.
如果你想在一台机器上启动多个作业,例如在一台8个gpu的机器上启动2个4-GPU训练作业,
您需要为每个作业指定不同的端口(默认为29500),以避免通信冲突。
如果使用' dist_train.sh '启动培训作业,可以在命令中设置端口。
>CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
Training on multiple nodes训练多个节点
MMDetection relies on torch.distributed
package for distributed training.
MMDetection依赖于torch.distributed
进行分布式训练。
Thus, as a basic usage, one can launch distributed training via PyTorch's launch utility.
因此,作为一种基本用法,可以通过PyTorch的launch实用程序启动分布式培训。
Manage jobs with Slurm用Slurm管理作业
Slurm is a good job scheduling system for computing clusters. On a cluster managed by Slurm, you can use slurm_train.sh
to spawn training jobs. It supports both single-node and multi-node training.
The basic usage is as follows.
Slurm是一个很好的计算集群作业调度系统。
在由Slurm管理的集群上,可以使用' slurm_train.sh '来生成培训作业。它支持单节点和多节点训练。
基本用法如下。
>[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
Below is an example of using 16 GPUs to train Mask R-CNN on a Slurm partition named dev, and set the work-dir to some shared file systems.
下面是一个示例,使用16个gpu在一个名为dev的Slurm分区上训练Mask R-CNN,并将work-dir设置为一些共享文件系统。
>GPUS=16 ./tools/slurm_train.sh dev mask_r50_1x configs/mask_rcnn_r50_fpn_1x_coco.py /nfs/xxxx/mask_rcnn_r50_fpn_1x
You can check the source code to review full arguments and environment variables.
您可以检查源代码来检查完整的参数和环境变量。
When using Slurm, the port option need to be set in one of the following ways:
当使用Slurm时,端口选项需要通过以下方式之一设置:
-
Set the port through
--options
. This is more recommended since it does not change the original configs.- 通过“——options”设置端口。更推荐这样做,因为它不会改变原始的配置。
>CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --options 'dist_params.port=29500'
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --options 'dist_params.port=29501'
-
Modify the config files to set different communication ports.
修改配置文件,设置不同的通信端口。
#In `config1.py`, set【nccl是英伟达的通信框架】
>dist_params = dict(backend='nccl', port=29500)
#In `config2.py`, set
>dist_params = dict(backend='nccl', port=29501)
#然后可以使用' config1.py '和' config2.py '启动两个作业。
>CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}