Video-Summarization-with-LSTM
代码是theano+python+matlab写的,感觉略为臃肿
另外代码之所以在一般的机器上都能跑,不是因为这份code效率高,是因为所有的数据都是封装的,像预训练的权重,一开始的图片特征,标签都是用其它方式处理过的
dppLSTM_main.py
dataset_testing = 'SumMe' # testing dataset: SumMe or TVSum
model_type = 2 # 1 for vsLSTM and 2 for dppLSTM, please refer to the readme file for more detail
train_set, val_set, val_idx, test_set, te_idx = data_loader.load_data(data_dir = '../data/', dataset_testing = dataset_testing, model_type = model_type)
train(model_idx = model_idx, train_set = train_set, val_set = val_set, model_saved = model_file)
inference(model_file=model_file, model_idx = model_idx, test_set=test_set, test_dir='./res_LSTM/', te_idx=te_idx
关于model_tpye的解释
- vsLSTM:2层LSTM+MLP(multi-layer perceptron)
- dppLSTM:2层LSTM+MLP+DPP(determinantal point process),加了行列式的计算过程
data_loader.py
下载并构建数据集(选择‘SumMe’作为test)
train:120,valid:19,test:25
#input的数据格式
>>> len(a[0])
291
>>> len(a[1])
291
>>> len(a[2])
11
>>> a[2]
array([156, 172, 176, 184, 189, 195, 199, 228, 230, 256, 272], dtype=int32)
>>> a[1][156]
1.0
>>> a[3]
array(1.0)
采用h5py格式封装了全部的数据,以OVP为例,包括[feature, label, weight]三部分,其中feature是GoogleNet的pooling 5的输出值,选择gt_1还是gt_2要看选择是哪个模型(vs还是dpp)
[feature, label, weight] = load_dataset_h5(data_dir, 'OVP', model_type)
label_tmp = [numpy.where(l)[0].astype('int32') for l in label]# 找出值为1的坐标,确定target?
这里有一点还不是很确定的,抽取确定标签中所有为‘1’的位置,是因为这一帧是target吗?通过这些target就能组合成“微电影”?
我们判断模型有没有训练好,就是判断我们抽取的帧与target的位置差异?
论文中提及了评价指标:F-score(P:精度,R:召回率)
因为所有的数据都是已经封装好的,我们只有这一帧的标签信息,特征信息之类的,那我们是如何判断A/B的交集的?
summ_dppLSTM.py
模型函数,定义了class summ_dppLSTM()
video = inputs[0]
label = inputs[1]
labelS = inputs[2]
dpp_weight = inputs[3]
#input的数据格式
>>> len(a[0])#video
291
>>> len(a[1])#label
291
>>> len(a[2])#labelS
11
>>> a[2]#labelS
array([156, 172, 176, 184, 189, 195, 199, 228, 230, 256, 272], dtype=int32)
>>> a[1][156]#labelS_2_label
1.0
>>> a[3]#dpp_weight,这貌似是后面用来防止过拟合的
array(1.0)
LSTM四个gate的数据初始化(输入门,细胞备用更新操作,遗忘门,输出门)
f = h5py.File(model_file)#'../models/model_trained_TVSum'
# image feature projection
self.c_init_mlp = mlp(model_file=model_file, layer_name='c_init_mlp', inputs=[T.mean(video, axis=0)], net_type='tanh')
self.h_init_mlp = mlp(model_file=model_file, layer_name='h_init_mlp', inputs=[T.mean(video, axis=0)], net_type='tanh')
# input gate
self.Wi = theano.shared(numpy.array(f[self.layer_name+'_Wi']).astype(theano.config.floatX))
self.Wi.name = self.layer_name + '_Wi'
self.bi = theano.shared(numpy.array(f[self.layer_name+'_bi']).astype(theano.config.floatX))
self.bi.name = self.layer_name + '_bi'
# input modulator
self.Wc = theano.shared(numpy.array(f[self.layer_name+'_Wc']).astype(theano.config.floatX))
self.Wc.name = self.layer_name + '_Wc'
self.bc = theano.shared(numpy.array(f[self.layer_name+'_bc']).astype(theano.config.floatX))
self.bc.name = self.layer_name + '_bc'
# forget gate
self.Wf = theano.shared(numpy.array(f[self.layer_name+'_Wf']).astype(theano.config.floatX))
self.Wf.name = self.layer_name + '_Wf'
self.bf = theano.shared(numpy.array(f[self.layer_name+'_bf']).astype(theano.config.floatX))
self.bf.name = self.layer_name + '_bf'
# output gate
self.Wo = theano.shared(numpy.array(f[self.layer_name+'_Wo']).astype(theano.config.floatX))
self.Wo.name = self.layer_name + '_Wo'
self.bo = theano.shared(numpy.array(f[self.layer_name+'_bo']).astype(theano.config.floatX))
self.bo.name = self.layer_name + '_bo'
# close the hdf5 model file
f.close()
LSTM的计算过程
def one_step(self, x_t, c_tm1, h_tm1):
x_and_h = T.concatenate([x_t, h_tm1], axis=0)#当前输入与上一时刻的隐含状态
i_t = T.nnet.sigmoid(T.dot(x_and_h, self.Wi) + self.bi)#输入门
c_tilde = T.tanh(T.dot(x_and_h, self.Wc) + self.bc)#细胞状态的备用更新
f_t = T.nnet.sigmoid(T.dot(x_and_h, self.Wf) + self.bf)#遗忘门
o_t = T.nnet.sigmoid(T.dot(x_and_h, self.Wo) + self.bo)#输出门
c_t = i_t * c_tilde + f_t * c_tm1#细胞状态更新操作,结合了当前的输入状态和备用的cell更新状态,以及遗忘门×上一时刻的cell
h_t = o_t * T.tanh(c_t)#当前隐含状态更新
return [c_t, h_t]
forwards
self.c0 = self.c_init_mlp.h[-1]#计算涉及到[-1]都是意味着取最后一个值,最新状态
self.h0 = self.h_init_mlp.h[-1]
([self.c, self.h], updates) = theano.scan(fn=self.one_step, sequences=[video], outputs_info=[self.c0, self.h0])
#这条函数据说是theano用来构建graph用的,有点像在进行sess.run(x,feed_dict={x:……})的操作
backwards
self.c0_back = self.c_init_mlp.h[-1]
self.h0_back = self.h_init_mlp.h[-1]
([self.c_back, self.h_back], updates) = theano.scan(fn=self.one_step, sequences=[video[::-1, :]], outputs_info=[self.c0_back, self.h0_back])
对比forwards和backwards的计算,这里有几个问题:
- 同样是使用one_step函数进行迭代更新,为什么一个是前馈计算,一个是后向传播?
- self.c0与self.c0_back看起来都是取值self.c_init_mlp.h[-1],二者是相等的?True
cost计算过程
vsLSTM计算公式,f I (·) for frame-level importancef
self.classify_mlp = mlp(model_file=model_file,
layer_name='classify_mlp',
# inputs=[self.h[-1, :]],
inputs =[self.h],
net_type='linear')
self.kernel_mlp = mlp(model_file=model_file,
layer_name='kernel_mlp',
# inputs=[self.h[-1, :]],
inputs =[self.h],
net_type='linear')
self.pred = self.classify_mlp.h[-1]
self.pred_k = self.kernel_mlp.h[-1]
kv = self.pred_k # kv means kernel_vector
qv = self.pred
K_mat = T.dot(kv, kv.T)#数值
Q_mat = T.outer(qv, qv)#方阵
L = K_mat * Q_mat
Ly = L[labelS, :][:, labelS]#L_z
dpp_loss = (- (T.log(T.nlinalg.Det()(Ly)) - T.log(T.nlinalg.Det()(L + T.identity_like(L)))))#对应公式2,5
if not T.isnan(dpp_loss):
loss = T.mean(T.sqr(self.pred.flatten() - label)) + dpp_weight * dpp_loss
else:
loss = T.mean(T.sqr(self.pred.flatten() - label)) + dpp_weight * T.nlinalg.Det()(Ly + T.identity_like(Ly)) # when the dpp_loss is nan, just randomly fill in a number#梯度裁剪
acc = T.log(T.nlinalg.Det()(L + T.identity_like(L)))
总的说一下计算的思路,主要是公式2跟公式5。从公式2中不难发现,Lz是L的某一子集,如果Lz中有相同的行或者相同的列,根据行列式的性质,|Lz|=0,从而我们可以得到zero-valued determinant。
问题:什么情况下会出现相同的行或者列,出现了又意味着什么?
#打印部分中间值出来
……
labelS = inputs[2]#9,间断点标签为9帧
pred = pred_values[0]#[280,1],对应qv,经过15帧的简单抽样,只剩下280帧,然后每一帧作为Xt输入LSTM中,输出值非0即1
pred_k = pred_values[1]#[280,256],对应kv,取当前隐含层状态,对应的其实是LSTM的h
……
K_mat = T.dot(kv, kv.T)#方阵,内积#[280,280],点乘,列向量在前,所以出现方阵
Q_mat = T.outer(qv, qv)#方阵,外积#[280,280],叉乘,
L = K_mat * Q_mat#[280,280]
Ly = L[labelS, :][:, labelS]#L_z#[9,9]
#结合前面的inputs数据
>>> a[2]#labelS
array([156, 172, 176, 184, 189, 195, 199, 228, 230], dtype=int32)
#L 是一个N×N的矩阵,
#L_z相当于从L中取出第i个数组(i属于a[2]),组成一个新的临时数组B=L[labelS, :],再从B中取出所有(axis=0)的子数组的第j个数组(j属于a[2])
Q_mat计算的是特征值,Ly就是公式中的Lz,labelS是一个确定的列表,
第二版
tensorflow版本实现细节