本人调试出错收集,一直会更新。。。。。。。。。。
GPU全部被占用,但是利用率只有1%
这里仅仅是对 Google object_detection API,不适合其他的情况。
我下载了model的两个版本。
使用旧的的trian.py,对最新的object detection api进行调用,导致gpu利用率上不去。最后更换train.py文件,利用率就上去了。
Faster RCNN 的batchsize是不是只能设置为1
有人提出三种方法解决:
I want to add an additional option to the ones mentioned above. As a summary, there are 3 possible solutions:
# 1. Add pad_to_max_dimension true in keep_aspect_ratio_resizer
keep_aspect_ratio_resizer {
pad_to_max_dimension : true
}
# 2. Change batch size to 1:
train_config: {
batch_size: 1
}
* 3 Use fixed_shape_resizer instead of keep_aspect_ratio_resizer
fixed_shape_resizer {
width: 600
height: 800
}
这个方法不会造成原始的标签位置位置改变,因为标签位置是按图片的比例保存的。
问题解决地址 https://github.com/tensorflow/models/issues/3697
TypeError: can't pickle dict_values objects
https://github.com/tensorflow/models/issues/4780
在model_lib.py中增加 把category_index.values() 变成list(category_index.values()).
keep_aspect_ratio_resizer(怎么设置min_dimension和max_dimension)请看:
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
https://github.com/tensorflow/models/issues/1794
自动停止
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 0.3352 (11.099 sec/step)
INFO:tensorflow:global step 2: loss = 0.3352 (4.418 sec/step)
INFO:tensorflow:global step 3: loss = 0.3352 (5.504 sec/step)
INFO:tensorflow:global step 4: loss = 0.3352 (7.470 sec/step)
INFO:tensorflow:global step 5: loss = 0.3352 (5.705 sec/step)
Killed
在faster rcnn进行训练的时候,这是一个巨坑,这里是没有报错的,程序只会自动停止。原因就是你的图片太大,你需要在创建tfcord文件的时候,就resize的图片,不用担心你的标签不对,因为min_dimension和max_dimension会把你的图片重新调整到合适的尺寸。
https://github.com/tensorflow/models/issues/1760
我的修改
# .................................................................................
img_path = os.path.join(data['folder'], image_subdirectory, data['filename'])
full_path = os.path.join(dataset_directory, img_path)
with tf.gfile.GFile(full_path, 'rb') as fid:
encoded_jpg = fid.read()
# resizing the image here
decoded_image = tf.image.decode_jpeg(encoded_jpg)
decoded_image_resized = tf.image.resize_images(decoded_image, [1024, 1024]) # this returns float32
decoded_image_resized = tf.cast(decoded_image_resized, tf.uint8)
encoded_jpg = tf.image.encode_jpeg(decoded_image_resized) # expects uint8
#encoded_jpg = bytes(encoded_jpg) # I think this may not be the right way of doing this
encoded_jpg = tf.Session().run(encoded_jpg)
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = PIL.Image.open(encoded_jpg_io)
# .............................................................................................
tensorflow.python.framework.errors_impl.UnknownError: train/pipeline.config;
Input/output error
说明你的train_dir没有设置。
--train_dir=train
ValueError: First step cannot be zero.
最新版本地model_train.py才会出错
解决方法地址:
https://github.com/tensorflow/models/issues/3794
删掉cofig文件中如下的地方:
schedule {
step: 0
learning_rate: .0001
}
调试时候图片上一个框也没有怎么回事
在进行可视化的使用这个函数
def visualize_boxes_and_labels_on_image_array(
image,
boxes,
classes,
scores,
category_index,
instance_masks=None,
instance_boundaries=None,
keypoints=None,
use_normalized_coordinates=False,
max_boxes_to_draw=20,
min_score_thresh=.5,
agnostic_mode=False,
line_thickness=4,
groundtruth_box_visualization_color='black',
skip_scores=False,
skip_labels=False):
这里的展示阀值为min_score_thresh=.5,修改你想要的值。