由于正在进行深度学习的研究,主要用的语言是python. 在实际写程序的过程中, 经常会遇到一些技巧性的东西,特此下来来并且不断更新, 如果有任何疑问, 麻烦在下方留言或者联系邮箱 strikerklaas@gmail.com.

one-hot vector

one-hot vector 在自然语言中处理非常重要, 常作为神经网络的输入, 有indexing的效果. 那么,实际情况中如何建立这样一个矩阵呢. 先考虑小的数据集. 比如有数据标记为两类0,1
one-hot vector is a term in NLP, as its name indicates, it is a vector where only one element is 1 and the others are 0s. Suppose that we have a vocabulary consists of 4000 words for text generation, there should exist 4000 unique one-hot vector for each word. For different tasks, there are different ways to initialize the vectors.

classification
Suppose that there are only 2 classes: 0 and 1. The two one-hot vectors should be [1,0],[0,1]. suppose that we have six learning samples but they are store in an array like [0,1,0,1,1,0], so, we produce an eye matrix first and let the array selects which vector they belong to form a matrix includes all samples.

>>> import numpy as np
>>> x = np.eye(2) # Two types of vectors
>>> y = np.array([0,1,0,1,1,0]) # classes
>>> x
array([[ 1.,  0.],
       [ 0.,  1.]])
>>> y
array([0, 1, 0, 1, 1, 0])
>>> x[y] # By indexing, we generate a matrix for learning
array([[ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.]])

float32 (theano)

The default floating point data type is float64, however, data must be tranferred to float32 to store in the GPU.

convert to float32

epilson = np.float32(0.01)

use shared statement

import theano
import theano.tensor as T
w = theano.shared((np.random.randn(input_dimension,output_dimension).astype('float32'), name='w')

MNIST dataset

The MNIST dataset is a universally-used dataset for digit recognition, its characters can be summed up as the following:

train set:50,000, validation set:10,000,test set:10,000
28 x 28 pixels (each training example is represented as a 1-dimensional array whose length is 784.
Now, we begin with opening the dataset in Python and try to optimize it to be used for GPU acceleration.
<pre><code>
import cPickle, gzip, numpy, theano

Load the dataset

f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()

Next, store the data into GPU memory

def share_dataset(data_xy):
# use theano shared value form
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
'''
Can also use the following syntax, it also works!
shared_x = theano.shared(data_x.astype('float32'))
shared_y = theano.shared(data_y.astype('float32'))
'''
# Since 'Y' should be intergers, not floats, we cast it
return shared_x, T.cast(shared_y, 'int32')

Now try it!

test_set_x, test_set_y = share_dataset(test_set)
valid_set_x, valid_set_y = share_dataset(valid_set)
train_set_x, train_set_y = share_dataset(train_set)

</code></pre>

代码块语法遵循标准markdown代码，例如：python@requires_authorizationdef somefunc(param1='', param2=0): '''A docstring''' if param1 > param2: # interesting print 'Greater' return (param2 - param1 + 1) or Noneclass SomeClass: pass>>> message = '''interpreter... prompt'''

每天进步一点点-tricks