2020-02-14

1.线性回归：矢量计算

在模型训练或预测时，我们常常会同时处理多个数据样本并用到矢量计算。在介绍线性回归的矢量计算表达式之前，让我们先考虑对两个向量相加的两种方法。

[if !supportLists]1. [endif]向量相加的一种方法是，将这两个向量按元素逐一做标量加法。

[if !supportLists]2. [endif]向量相加的另一种方法是，将这两个向量直接做矢量加法。

现在我们可以来测试了。首先将两个向量使用for循环按元素逐一做标量加法。

import torch

import time

# init variable a, b as 1000 dimension vector

n = 1000

a = torch.ones(n)

b = torch.ones(n)

# define a timer class to record time

class Timer(object):

"""Record multiple running times."""

def __init__(self):

self.times = []

self.start()

def start(self):

# start the timer

self.start_time = time.time()

def stop(self):

# stop the timer and record time into a list

self.times.append(time.time() - self.start_time)

return self.times[-1]

def avg(self):

# calculate the average and return

return sum(self.times)/len(self.times)

def sum(self):

# return the sum of recorded time

return sum(self.times)

现在我们可以来测试了。首先将两个向量使用for循环按元素逐一做标量加法。

timer = Timer()

c = torch.zeros(n)

for i in range(n):

c[i] = a[i] + b[i]

'%.5f sec' % timer.stop()

另外是使用torch来将两个向量直接做矢量加法：

timer.start()

d = a + b

'%.5f sec' % timer.stop()

结果很明显,后者比前者运算速度更快。因此，我们应该尽可能采用矢量计算，以提升计算效率。

线性回归模型从零开始的实现

# import packages and modules

%matplotlib inline

import torch

from IPython import display

from matplotlib import pyplot as plt

import numpy as np

import random

print(torch.__version__)

生成数据集

使用线性模型来生成数据集，生成一个1000个样本的数据集，下面是用来生成数据的线性关系：

price=warea⋅area+wage⋅age+b

# set input feature number

num_inputs = 2

# set example number

num_examples = 1000

# set true weight and bias in order to generate corresponded label

true_w = [2, -3.4]

true_b = 4.2

features = torch.randn(num_examples, num_inputs,

dtype=torch.float32)

labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b

labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),

dtype=torch.float32)

使用图像来展示生成的数据

plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);

读取数据集

def data_iter(batch_size, features, labels):

num_examples = len(features)

indices = list(range(num_examples))

random.shuffle(indices) # random read 10 samples

for i in range(0, num_examples, batch_size):

j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch

yield features.index_select(0, j), labels.index_select(0, j)

batch_size = 10

for X, y in data_iter(batch_size, features, labels):

print(X, '\n', y)

break

初始化模型参数

w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)

b = torch.zeros(1, dtype=torch.float32)

w.requires_grad_(requires_grad=True)

b.requires_grad_(requires_grad=True)

定义模型

定义用来训练参数的训练模型：

price=warea⋅area+wage⋅age+b

def linreg(X, w, b):

return torch.mm(X, w) + b

定义优化函数

在这里优化函数使用的是小批量随机梯度下降：

(w,b)←(w,b)−η|B|∑i∈B∂(w,b)l(i)(w,b)

def sgd(params, lr, batch_size):

for param in params:

param.data -= lr * param.grad / batch_size

# ues .data to operate param without gradient track

训练

当数据集、模型、损失函数和优化函数定义完了之后就可来准备进行模型的训练了。

# super parameters initlr = 0.03num_epochs = 5

net = linregloss = squared_loss

# trainingfor epoch in range(num_epochs): # training repeats num_epochs times

# in each epoch, all the samples in dataset will be used once

# X is the feature and y is the label of a batch sample

for X, y in data_iter(batch_size, features, labels):

l = loss(net(X, w, b), y).sum()

# calculate the gradient of batch sample loss

l.backward()

# using small batch random gradient descent to iter model parameters

sgd([w, b], lr, batch_size)

# reset parameter gradient

w.grad.data.zero_()

b.grad.data.zero_()

train_l = loss(net(features, w, b), labels)

print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))

w, true_w, b, true_b

线性回归模型使用pytorch的简洁实现

import torch

from torch import nn

import numpy as np

torch.manual_seed(1)

print(torch.__version__)

torch.set_default_tensor_type('torch.FloatTensor')

生成数据集

在这里生成数据集跟从零开始的实现中是完全一样的。

num_inputs = 2

num_examples = 1000

true_w = [2, -3.4]

true_b = 4.2

features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)

labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b

labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

读取数据集

import torch.utils.data as Data

batch_size = 10

# combine featues and labels of dataset

dataset = Data.TensorDataset(features, labels)

# put dataset into DataLoader

data_iter = Data.DataLoader(

dataset=dataset, # torch TensorDataset format

batch_size=batch_size, # mini batch size

shuffle=True, # whether shuffle the data or not

num_workers=2, # read data in multithreading

)

for X, y in data_iter:

print(X, '\n', y)

break

定义模型

class LinearNet(nn.Module):

def __init__(self, n_feature):

super(LinearNet, self).__init__() # call father function to init

self.linear = nn.Linear(n_feature, 1) # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`

def forward(self, x):

y = self.linear(x)

return y

net = LinearNet(num_inputs)print(net)

# ways to init a multilayer network# method onenet = nn.Sequential(

nn.Linear(num_inputs, 1)

# other layers can be added here

)

# method tw

net = nn.Sequential()net.add_module('linear', nn.Linear(num_inputs, 1))

# net.add_module ......

# method three

from collections import OrderedDict

net = nn.Sequential(OrderedDict([

('linear', nn.Linear(num_inputs, 1))

# ......

]))

print(net)

print(net[0])

初始化模型参数

In [22]:

from torch.nn import init

init.normal_(net[0].weight, mean=0.0, std=0.01)

init.constant_(net[0].bias, val=0.0) # or you can use `net[0].bias.data.fill_(0)` to modify it directly

In [23]:

for param in net.parameters():

print(param)

定义损失函数

In [24]:

loss = nn.MSELoss() # nn built-in squared loss function

# function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`

定义优化函数

In [25]:

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.03) # built-in random gradient descent function

print(optimizer) # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`

训练

In [26]:

num_epochs = 3

for epoch in range(1, num_epochs + 1):

for X, y in data_iter:

output = net(X)

l = loss(output, y.view(-1, 1))

optimizer.zero_grad() # reset gradient, equal to net.zero_grad()

l.backward()

optimizer.step()

print('epoch %d, loss: %f' % (epoch, l.item()))

In [27]:

# result comparision

dense = net[0]

print(true_w, dense.weight.data)

print(true_b, dense.bias.data)

2.补充：【numpy库函数】reshape用法，包含-1这个参数

（原帖：https://blog.csdn.net/qq_37791134/article/details/90543879）

numpy.reshape（重塑）：给数组一个新的形状而不改变其数据

numpy.reshape(a, newshape, order=’C’)

参数：

[if !supportLists]1. [endif]a：array_like

要重新形成的数组。

[if !supportLists]2. [endif]newshape：int或tuple的整数

新的形状应该与原始形状兼容。如果是整数，则结果将是该长度的1-D数组。一个形状维度可以是-1。在这种情况下，从数组的长度和其余维度推断该值。

[if !supportLists]3. [endif]order：{'C'，'F'，'A'}可选

使用此索引顺序读取a的元素，并使用此索引顺序将元素放置到重新形成的数组中。'C'意味着使用C样索引顺序读取/写入元素，最后一个轴索引变化最快，回到第一个轴索引变化最慢。'F'意味着使用Fortran样索引顺序读取/写入元素，第一个索引变化最快，最后一个索引变化最慢。注意，'C'和'F'选项不考虑底层数组的内存布局，而只是参考索引的顺序。'A'意味着在Fortran类索引顺序中读/写元素，如果a 是Fortran 在内存中连续的，否则为C样顺序。

注意：newshape : int or tuple of ints

大意是说，数组新的shape属性应该要与原来的配套，如果等于-1的话，那么Numpy会根据剩下的维度计算出数组的另外一个newshape属性值。

举例：有一个数组z，它的shape属性是(4, 4)

z = np.array([[1, 2, 3, 4],

[5, 6, 7, 8],

[9, 10, 11, 12],

[13, 14, 15, 16]])

z.shape

(4, 4)

z.reshape(-1)

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])

z.reshape(-1, 1)，是说，我们不知道新z的行数是多少，但是想让z变成只有一列，行数不知的新数组，通过z.reshape(-1,1)，Numpy自动计算出有12行，新的数组shape属性为(16, 1)，与原来的(4, 4)配套。z.reshape(-1,1)

array([[ 1],

[ 2],

[ 3],

[ 4],

[ 5],

[ 6],

[ 7],

[ 8],

[ 9],

[10],

[11],

[12],

[13],

[14],

[15],

[16]])

z.reshape(-1, 2)，行数未知，列数等于2，reshape后的shape等于(8, 2)

z.reshape(-1, 2)

array([[ 1, 2],

[ 3, 4],

[ 5, 6],

[ 7, 8],

[ 9, 10],

[11, 12],

[13, 14],

[15, 16]])

同理，只给定行数，列数未知，也可以设置newshape等于-1，Numpy也可以自动计算出新数组的列数。

[if !supportLists]4. [endif]最⼤似然估计与最小化交叉熵损失函数

（https://blog.csdn.net/zgcr654321/article/details/85204049）

似然的概念：“似然”用通俗的话来说就是可能性，极大似然就是最大的可能性。

似然函数：似然函数是关于统计模型中的一组概率的函数（这些概率的真实值我们并不知道），似然函数的因变量值表示了模型中的概率参数的似然性（可能性）。

最大似然估计：我们列出似然函数后，从真实事件中取得一批n个采样样本数据，最大似然估计会寻找基于我们的n个值的采样数据得到的关于的最可能的概率值（即在所有可能的概率取值中，寻找一组概率值使这n个值的采样数据的“可能性”最大化）。

最大似然估计中采样需满足一个很重要的假设，就是所有的采样都是独立同分布的。

伯努利分布：伯努利分布又名两点分布或0-1分布，介绍伯努利分布前首先需要引入伯努利试验。

伯努利试验是只有两种可能结果的单次随机试验，即对于一个随机变量X而言:

P(X=1)=p

P(X=0)=1−p

伯努利试验可以表达为“是或否”的问题。

如果试验E是一个伯努利试验，将E独立重复地进行n次，则称这一串重复的独立试验为n重伯努利试验。

进行一次伯努利试验，成功概率为p，失败概率为1-p，则称随机变量X服从伯努利分布。

其概率质量函数为:

伯努利分布的

伯努利分布是一个离散型机率分布，是N=1时二项分布的特殊情况。

伯努利分布下的最大似然估计推导出交叉熵损失函数：

假设

P(X=1)=p,P(X=0)=1−p

则有概率质量函数为

因为我们只有一组采样数据集D，我们可以统计得到X和1-X的值，但p值（概率）未知。下面我们要建立对数似然函数，并根据采样数据集D求出P。

对数似然函数为：

我们可以发现上式和深度学习模型中交叉熵损失函数的形式几乎相同。这个函数的值总是小于0的，而我们要做极大似然估计就是求其极大值，也就是说，这个函数的值在深度学习的梯度下降过程中从一个负数不断增大接近0（始终小于0）。为了与其他损失函数形式统一，我们在前面加上一个负号，这样就和其他损失函数一样是从一个大值不断降低向0接近了。

深度学习模型中的交叉熵函数形式:

现在我们再用求导得极小值点的方法来求其极大似然估计，首先将上式对p求导，并令导数为0。

消去分母，得:

这就是伯努利分布下最大似然估计求出的P。

2020-02-14

推荐阅读更多精彩内容