object_detectionAPI源码阅读笔记（11-本人调试出错收集）

本人调试出错收集，一直会更新。。。。。。。。。。

GPU全部被占用，但是利用率只有1%

这里仅仅是对 Google object_detection API，不适合其他的情况。

我下载了model的两个版本。
使用旧的的trian.py，对最新的object detection api进行调用，导致gpu利用率上不去。最后更换train.py文件，利用率就上去了。

Faster RCNN 的batchsize是不是只能设置为1

有人提出三种方法解决：

I want to add an additional option to the ones mentioned above. As a summary, there are 3 possible solutions:

#  1. Add pad_to_max_dimension true in keep_aspect_ratio_resizer
keep_aspect_ratio_resizer {
  pad_to_max_dimension : true
}
#  2. Change batch size to 1:
train_config: {
  batch_size: 1
}
* 3 Use fixed_shape_resizer instead of keep_aspect_ratio_resizer
fixed_shape_resizer { 
  width: 600 
  height: 800
}

这个方法不会造成原始的标签位置位置改变，因为标签位置是按图片的比例保存的。

问题解决地址 https://github.com/tensorflow/models/issues/3697

TypeError: can't pickle dict_values objects

https://github.com/tensorflow/models/issues/4780

在model_lib.py中增加把category_index.values() 变成list(category_index.values()).

keep_aspect_ratio_resizer(怎么设置min_dimension和max_dimension)请看：

model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}

https://github.com/tensorflow/models/issues/1794

自动停止

INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 0.3352 (11.099 sec/step)
INFO:tensorflow:global step 2: loss = 0.3352 (4.418 sec/step)
INFO:tensorflow:global step 3: loss = 0.3352 (5.504 sec/step)
INFO:tensorflow:global step 4: loss = 0.3352 (7.470 sec/step)
INFO:tensorflow:global step 5: loss = 0.3352 (5.705 sec/step)
Killed

在faster rcnn进行训练的时候，这是一个巨坑，这里是没有报错的，程序只会自动停止。原因就是你的图片太大，你需要在创建tfcord文件的时候，就resize的图片，不用担心你的标签不对，因为min_dimension和max_dimension会把你的图片重新调整到合适的尺寸。
https://github.com/tensorflow/models/issues/1760
我的修改

# .................................................................................
  img_path = os.path.join(data['folder'], image_subdirectory, data['filename'])
  full_path = os.path.join(dataset_directory, img_path)
  with tf.gfile.GFile(full_path, 'rb') as fid:
    encoded_jpg = fid.read()
  # resizing the image here
  decoded_image = tf.image.decode_jpeg(encoded_jpg)
  decoded_image_resized = tf.image.resize_images(decoded_image, [1024, 1024]) # this returns float32
  decoded_image_resized = tf.cast(decoded_image_resized, tf.uint8)
  encoded_jpg   = tf.image.encode_jpeg(decoded_image_resized) # expects uint8
  #encoded_jpg   = bytes(encoded_jpg) #  I think this may not be the right way of doing this
  encoded_jpg = tf.Session().run(encoded_jpg)
  encoded_jpg_io = io.BytesIO(encoded_jpg)
  image = PIL.Image.open(encoded_jpg_io)
# .............................................................................................

tensorflow.python.framework.errors_impl.UnknownError: train/pipeline.config;

Input/output error
说明你的train_dir没有设置。

--train_dir=train

ValueError: First step cannot be zero.

最新版本地model_train.py才会出错
解决方法地址：
https://github.com/tensorflow/models/issues/3794

删掉cofig文件中如下的地方：

          schedule {
            step: 0
            learning_rate: .0001
          }

调试时候图片上一个框也没有怎么回事

在进行可视化的使用这个函数

def visualize_boxes_and_labels_on_image_array(
    image,
    boxes,
    classes,
    scores,
    category_index,
    instance_masks=None,
    instance_boundaries=None,
    keypoints=None,
    use_normalized_coordinates=False,
    max_boxes_to_draw=20,
    min_score_thresh=.5,
    agnostic_mode=False,
    line_thickness=4,
    groundtruth_box_visualization_color='black',
    skip_scores=False,
    skip_labels=False):

这里的展示阀值为min_score_thresh=.5，修改你想要的值。

最后编辑于：2018.11.22 13:41:18