Video-Summarization-with-LSTM
代码是theano+python+matlab写的，感觉略为臃肿
另外代码之所以在一般的机器上都能跑，不是因为这份code效率高，是因为所有的数据都是封装的，像预训练的权重，一开始的图片特征，标签都是用其它方式处理过的

dppLSTM_main.py

dataset_testing = 'SumMe' # testing dataset: SumMe or TVSum
model_type = 2 # 1 for vsLSTM and 2 for dppLSTM, please refer to the readme file for more detail
train_set, val_set, val_idx, test_set, te_idx = data_loader.load_data(data_dir = '../data/', dataset_testing = dataset_testing, model_type = model_type)
train(model_idx = model_idx, train_set = train_set, val_set = val_set, model_saved = model_file)
inference(model_file=model_file, model_idx = model_idx, test_set=test_set, test_dir='./res_LSTM/', te_idx=te_idx

关于model_tpye的解释

vsLSTM：2层LSTM+MLP（multi-layer perceptron）
dppLSTM：2层LSTM+MLP+DPP(determinantal point process)，加了行列式的计算过程

vsLSTM.png

dppLSTM.png

data_loader.py

下载并构建数据集（选择‘SumMe’作为test）
train:120，valid:19，test:25

#input的数据格式
>>> len(a[0])
    291
>>> len(a[1])
    291
>>> len(a[2])
    11
>>> a[2]
    array([156, 172, 176, 184, 189, 195, 199, 228, 230, 256, 272], dtype=int32)
>>> a[1][156]
    1.0
>>> a[3]
    array(1.0)

采用h5py格式封装了全部的数据，以OVP为例，包括[feature, label, weight]三部分，其中feature是GoogleNet的pooling 5的输出值，选择gt_1还是gt_2要看选择是哪个模型（vs还是dpp）

 [feature, label, weight] = load_dataset_h5(data_dir, 'OVP', model_type)
 label_tmp = [numpy.where(l)[0].astype('int32') for l in label]# 找出值为1的坐标，确定target？

这里有一点还不是很确定的，抽取确定标签中所有为‘1’的位置，是因为这一帧是target吗？通过这些target就能组合成“微电影”？
我们判断模型有没有训练好，就是判断我们抽取的帧与target的位置差异？
论文中提及了评价指标：F-score（P：精度，R：召回率）

F-score.png

因为所有的数据都是已经封装好的，我们只有这一帧的标签信息，特征信息之类的，那我们是如何判断A/B的交集的？

summ_dppLSTM.py

模型函数，定义了class summ_dppLSTM()

video = inputs[0]
label = inputs[1]
labelS = inputs[2]
dpp_weight = inputs[3]

#input的数据格式
>>> len(a[0])#video
    291
>>> len(a[1])#label
    291
>>> len(a[2])#labelS
    11
>>> a[2]#labelS
    array([156, 172, 176, 184, 189, 195, 199, 228, 230, 256, 272], dtype=int32)
>>> a[1][156]#labelS_2_label
    1.0
>>> a[3]#dpp_weight，这貌似是后面用来防止过拟合的
    array(1.0)

LSTM四个gate的数据初始化（输入门，细胞备用更新操作，遗忘门，输出门）

        f = h5py.File(model_file)#'../models/model_trained_TVSum'
        # image feature projection
        self.c_init_mlp = mlp(model_file=model_file, layer_name='c_init_mlp', inputs=[T.mean(video, axis=0)], net_type='tanh')
        self.h_init_mlp = mlp(model_file=model_file, layer_name='h_init_mlp', inputs=[T.mean(video, axis=0)], net_type='tanh')
        # input gate
        self.Wi = theano.shared(numpy.array(f[self.layer_name+'_Wi']).astype(theano.config.floatX))
        self.Wi.name = self.layer_name + '_Wi'
        self.bi = theano.shared(numpy.array(f[self.layer_name+'_bi']).astype(theano.config.floatX))
        self.bi.name = self.layer_name + '_bi'
        # input modulator
        self.Wc = theano.shared(numpy.array(f[self.layer_name+'_Wc']).astype(theano.config.floatX))
        self.Wc.name = self.layer_name + '_Wc'
        self.bc = theano.shared(numpy.array(f[self.layer_name+'_bc']).astype(theano.config.floatX))
        self.bc.name = self.layer_name + '_bc'
        # forget gate
        self.Wf = theano.shared(numpy.array(f[self.layer_name+'_Wf']).astype(theano.config.floatX))
        self.Wf.name = self.layer_name + '_Wf'
        self.bf = theano.shared(numpy.array(f[self.layer_name+'_bf']).astype(theano.config.floatX))
        self.bf.name = self.layer_name + '_bf'
        # output gate
        self.Wo = theano.shared(numpy.array(f[self.layer_name+'_Wo']).astype(theano.config.floatX))
        self.Wo.name = self.layer_name + '_Wo'
        self.bo = theano.shared(numpy.array(f[self.layer_name+'_bo']).astype(theano.config.floatX))
        self.bo.name = self.layer_name + '_bo'
        # close the hdf5 model file
        f.close()

LSTM的计算过程

def one_step(self, x_t, c_tm1, h_tm1):
    x_and_h = T.concatenate([x_t, h_tm1], axis=0)#当前输入与上一时刻的隐含状态
    i_t = T.nnet.sigmoid(T.dot(x_and_h, self.Wi) + self.bi)#输入门
    c_tilde = T.tanh(T.dot(x_and_h, self.Wc) + self.bc)#细胞状态的备用更新
    f_t = T.nnet.sigmoid(T.dot(x_and_h, self.Wf) + self.bf)#遗忘门
    o_t = T.nnet.sigmoid(T.dot(x_and_h, self.Wo) + self.bo)#输出门
    c_t = i_t * c_tilde + f_t * c_tm1#细胞状态更新操作，结合了当前的输入状态和备用的cell更新状态，以及遗忘门×上一时刻的cell
    h_t = o_t * T.tanh(c_t)#当前隐含状态更新
    return [c_t, h_t]

forwards

    self.c0 = self.c_init_mlp.h[-1]#计算涉及到[-1]都是意味着取最后一个值，最新状态
    self.h0 = self.h_init_mlp.h[-1]
    ([self.c, self.h], updates) = theano.scan(fn=self.one_step, sequences=[video], outputs_info=[self.c0, self.h0])
    #这条函数据说是theano用来构建graph用的，有点像在进行sess.run(x,feed_dict={x:……})的操作

backwards

    self.c0_back = self.c_init_mlp.h[-1]
    self.h0_back = self.h_init_mlp.h[-1]
    ([self.c_back, self.h_back], updates) = theano.scan(fn=self.one_step, sequences=[video[::-1, :]], outputs_info=[self.c0_back, self.h0_back])

对比forwards和backwards的计算，这里有几个问题：

同样是使用one_step函数进行迭代更新，为什么一个是前馈计算，一个是后向传播？
self.c0与self.c0_back看起来都是取值self.c_init_mlp.h[-1]，二者是相等的？True

cost计算过程

公式1.png

vsLSTM计算公式，f I (·) for frame-level importancef

公式2.png

公式3.png

公式4.png

公式5.png

  self.classify_mlp = mlp(model_file=model_file,
                                layer_name='classify_mlp',
                                # inputs=[self.h[-1, :]],
                                inputs =[self.h],
                                net_type='linear')

   self.kernel_mlp = mlp(model_file=model_file,
                                layer_name='kernel_mlp',
                                # inputs=[self.h[-1, :]],
                                inputs =[self.h],
                                net_type='linear')
    self.pred = self.classify_mlp.h[-1]
    self.pred_k = self.kernel_mlp.h[-1]

    kv = self.pred_k # kv means kernel_vector
    qv = self.pred
    K_mat = T.dot(kv, kv.T)#数值
    Q_mat = T.outer(qv, qv)#方阵
    L = K_mat * Q_mat
    Ly = L[labelS, :][:, labelS]#L_z

    dpp_loss = (- (T.log(T.nlinalg.Det()(Ly)) - T.log(T.nlinalg.Det()(L + T.identity_like(L)))))#对应公式2，5
    if not T.isnan(dpp_loss):
        loss = T.mean(T.sqr(self.pred.flatten() - label)) + dpp_weight * dpp_loss
    else:
        loss = T.mean(T.sqr(self.pred.flatten() - label)) + dpp_weight * T.nlinalg.Det()(Ly + T.identity_like(Ly)) # when the dpp_loss is nan, just randomly fill in a number#梯度裁剪

    acc = T.log(T.nlinalg.Det()(L + T.identity_like(L)))

总的说一下计算的思路，主要是公式2跟公式5。从公式2中不难发现，L_z是L的某一子集，如果L_z中有相同的行或者相同的列，根据行列式的性质，|L_z|=0，从而我们可以得到zero-valued determinant。
问题：什么情况下会出现相同的行或者列，出现了又意味着什么？

#打印部分中间值出来
……
labelS = inputs[2]#9，间断点标签为9帧
pred = pred_values[0]#[280,1]，对应qv，经过15帧的简单抽样，只剩下280帧，然后每一帧作为Xt输入LSTM中，输出值非0即1
pred_k = pred_values[1]#[280,256]，对应kv，取当前隐含层状态，对应的其实是LSTM的h
……

K_mat = T.dot(kv, kv.T)#方阵，内积#[280,280]，点乘，列向量在前，所以出现方阵
Q_mat = T.outer(qv, qv)#方阵，外积#[280,280]，叉乘，
L = K_mat * Q_mat#[280,280]
Ly = L[labelS, :][:, labelS]#L_z#[9,9]

#结合前面的inputs数据
>>> a[2]#labelS
    array([156, 172, 176, 184, 189, 195, 199, 228, 230], dtype=int32)
    #L 是一个N×N的矩阵，
    #L_z相当于从L中取出第i个数组(i属于a[2])，组成一个新的临时数组B=L[labelS, :]，再从B中取出所有(axis=0)的子数组的第j个数组(j属于a[2])

Q_mat计算的是特征值，Ly就是公式中的L_z，labelS是一个确定的列表，

第二版

tensorflow版本实现细节

Video-LSTM_代码理解ing