介绍PyTorch的简单示例
翻译自Github地址 jcjohnson/pytorch-examples
增加了个人的理解和注释,比如反向传播画了一张图来更清晰的表明反向传播的过程。
将英文翻译成了中文更方便阅读。
特别喜欢这个仓库,从numpy介绍到了tensor,然后从手动实现反向传播介绍到了如何利用Pytorch提供的自动微分来进行反向传播,从自己动手实现模型,损失函数,权重更新到如何利用已有的包自定义模型,调用损失函数和优化器。
整个仓库看完应该对pytorch的原理掌握的比较透彻了。
Simple examples to introduce PyTorch
该仓库通过独立的示例介绍了PyTorch的基本概念。
PyTorch的核心是提供两个主要功能:
- n维Tensor,类似于numpy,但可以在GPU上运行。
- 自动微分,用于构建和训练神经网络
我们将使用完全连接的ReLU网络作为我们的运行示例。该网络将具有单个隐藏层,并且将通过最小化网络输出与真实输出之间的欧几里德距离来进行梯度下降训练,以适应随机数据。
注意:这些示例已针对PyTorch 0.4进行了更新,对核心PyTorch API进行了几项重大更改。最值得注意的是,在0.4之前,必须将Tensor包裹在Variable对象中才能使用autograd。现在,此功能已直接添加到张量中,并且不建议使用变量。
目录
- Warm-up: numpy
- PyTorch: Tensors
- PyTorch: Autograd
- PyTorch: Defining new autograd functions
- TensorFlow: Static Graphs
- PyTorch: nn
- PyTorch: optim
- PyTorch: Custom nn Modules
- PyTorch: Control Flow and Weight Sharing
1. Warm-up: numpy
在介绍PyTorch之前,我们将首先使用numpy实现网络。
Numpy提供一个n维数组array,以及许多用于操作这些array的函数。Numpy是科学计算的通用框架;它对计算图形、深度学习或梯度一无所知。但是,我们可以使用numpy操作通过网络手动实现向前和向后传递 forward and backward passes,从而轻松地使用numpy使两层网络适合随机数据:
import numpy as np
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x using Euclidean error.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation uses numpy to manually compute the forward pass, loss, and
backward pass.
该程序实现了使用numpy手动计算前向传播,损失和后向传播。
A numpy array is a generic n-dimensional array; it does not know anything about
deep learning or gradients or computational graphs, and is just a way to perform
generic numeric computations.
numpy数组是通用的n维数组;它对深度学习,梯度或计算图一无所知,只是执行通用数值计算的一种方法。
"""
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in) # 输入 (64,1000)
y = np.random.randn(N, D_out) # 输出 (64,10)
# Randomly initialize weights
w1 = np.random.randn(D_in, H) # 输入层-隐藏层 权重 (1000,100)
w2 = np.random.randn(H, D_out) # 隐藏层-输出层 权重 (100,10)
learning_rate = 1e-6 # 学习率
for t in range(500):
# Forward pass: compute predicted y 前向传播:计算预测的y
h = x.dot(w1) # 点乘 得到隐藏层 (64,100)
h_relu = np.maximum(h, 0) # 计算relu激活函数
# np.maximum(X, Y, out=None) X和Y相比取最大值
# np.max(a, axis=None, out=None, keepdims=False) 求序列的最值, axis:默认为列向(也即 axis=0),axis = 1 时为行方向的最值
y_pred = h_relu.dot(w2) # 点乘 得到输出层 (64,10)
# Compute and print loss
loss = np.square(y_pred - y).sum() # .sum()所有元素的总和
print(t, loss) # 目的就是使Loss越来越小
# Backprop to compute gradients of w1 and w2 with respect to loss
# 反向传播的过程(难点),详见下图推导过程
grad_y_pred = 2.0 * (y_pred - y) # (64,10)
grad_w2 = h_relu.T.dot(grad_y_pred) # (64,100)^T dot (64,10) = (100,10)
grad_h_relu = grad_y_pred.dot(w2.T) # (64,100)
grad_h = grad_h_relu.copy() # 深拷贝 (64,100)
grad_h[h < 0] = 0 # Relu反向传播处理过程
grad_w1 = x.T.dot(grad_h) # (1000,100)
# Update weights 更新权重
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
这是一个针对前面提到的反向传播的推导图
2. PyTorch: Tensors
Numpy是一个很棒的框架,但是它不能利用GPU来加速其数值计算。对于现代深度神经网络,GPU通常可提供50倍或更高的加速比,因此不幸的是,仅凭numpy不足以实现现代深度学习。
在这里,我们介绍最基本的PyTorch概念:Tensor张量。PyTorch的张量在概念上与numpy数组相同:Tensor是n维数组,而PyTorch提供了许多在这些张量上运行的函数。您可能希望使用numpy执行的任何计算也可以使用PyTorch的Tensors完成;您应该将它们视为科学计算的通用工具。
但是,与numpy不同,PyTorch张量可以利用GPU加速其数字计算。要在GPU上运行PyTorch Tensor,请在构造Tensor时使用device
参数将Tensor放置在GPU上。
在这里,我们使用PyTorch张量使两层网络适合随机数据。像上面的numpy示例一样,我们使用PyTorch张量上的操作来手动实现网络的正向和反向传递:
import torch
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation uses PyTorch tensors to manually compute the forward pass,
loss, and backward pass.
该程序实现使用PyTorch张量手动计算前向传播,损失和后向传播。
A PyTorch Tensor is basically the same as a numpy array: it does not know
anything about deep learning or computational graphs or gradients, and is just
a generic n-dimensional array to be used for arbitrary numeric computation.
PyTorch张量基本上与numpy数组相同:它对深度学习,计算图或梯度一无所知,只是用于任意数值计算的通用n维数组。
The biggest difference between a numpy array and a PyTorch Tensor is that
a PyTorch Tensor can run on either CPU or GPU. To run operations on the GPU,
just pass a different value to the `device` argument when constructing the
Tensor.
numpy数组和PyTorch张量之间的最大区别是PyTorch张量可以在CPU或GPU上运行。要在GPU上运行操作,只需在构造Tensor时将不同的值传递给device参数即可。
"""
device = torch.device('cpu') # CPU环境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU环境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device) # 输入 (64,1000)
y = torch.randn(N, D_out, device=device) # 输出 (64,10)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device) # 输入层-隐藏层 权重 (1000,100)
w2 = torch.randn(H, D_out, device=device) # 隐藏层-输出层 权重 (100,10)
learning_rate = 1e-6 # 学习率
for t in range(500):
# Forward pass: compute predicted y 前向传播:计算预测的y
h = x.mm(w1) # 点乘 得到隐藏层 (64,100)
# torch.mm()矩阵相乘
# torch.mul() 矩阵位相乘
h_relu = h.clamp(min=0) # 计算relu激活函数
# torch.clamp(input, min, max, out=None) → Tensor 将输入input张量每个元素的夹紧到区间 [min,max][min,max],并返回结果到一个新张量。
y_pred = h_relu.mm(w2) # 点乘 得到输出层 (64,10)
# Compute and print loss; loss is a scalar标量, and is stored in a PyTorch Tensor
# of shape (); we can get its value as a Python number with loss.item().
loss = (y_pred - y).pow(2).sum() # .sum()所有元素的总和 torch.Size([])
print(t, loss.item()) # pytorch中的.item()用于将一个零维张量转换成浮点数
# Backprop to compute gradients of w1 and w2 with respect to loss
# 反向传播的过程(难点),具体过程同上,没有变化
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
# torch.clone()和torch.copy()应该没什么区别
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent 更新权重
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
3. PyTorch: Autograd
在以上示例中,我们必须手动实现神经网络的前向和后向传递。对于小型的两层网络而言,手动实施反向传递并不重要,但对于大型的复杂网络而言,可以很快变得非常麻烦。
幸运的是,我们可以使用自动微分来自动计算神经网络中的反向通过。PyTorch中的autograd包完全提供了此函数。使用autograd时,网络的正向传递将定义一个computational graph计算图;图中的节点为张量,边为从输入张量产生输出张量的函数。然后通过该图进行反向传播,可以轻松计算梯度。
这听起来很复杂,在实践中非常简单。如果我们要针对某个张量计算梯度,那么在构造该张量时,我们需要设置require_grad=True
。该Tensor上的任何PyTorch操作都将导致构建计算图,从而使我们以后可以通过该图执行反向传播。如果x
是张量为require_grad=True
的张量,那么在反向传播之后,x.grad
将是另一个张量,它保持x
相对于某个标量值的梯度。
有时您可能希望在对require_grad=True
的张量执行某些操作时,阻止PyTorch构建计算图。例如,在训练神经网络时,我们通常不想在权重更新步骤中向后传播。在这种情况下,我们可以使用torch.no_grad()
上下文管理器来防止构建计算图。
在这里,我们使用PyTorch张量和autograd来实现我们的两层网络。现在我们不再需要手动通过网络实现反向传递:
import torch
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation computes the forward pass using operations on PyTorch
Tensors, and uses PyTorch autograd to compute gradients.
该程序实现使用PyTorch张量上的运算来计算前向传播,并使用PyTorch autograd来计算梯度。
When we create a PyTorch Tensor with requires_grad=True, then operations
involving that Tensor will not just compute values; they will also build up
a computational graph in the background, allowing us to easily backpropagate
through the graph to compute gradients of some downstream (scalar) loss with
respect to a Tensor. Concretely if x is a Tensor with x.requires_grad == True
then after backpropagation x.grad will be another Tensor holding the gradient
of x with respect to some scalar value.
当我们使用require_grad = True创建一个PyTorch Tensor时,涉及该Tensor的操作将不仅仅计算值;
他们还将在后台建立一个计算图,使我们能够轻松地在该图中反向传播,以计算相对于张量的某些下游(标量)
损耗的梯度。具体来说,如果x是具有x.requires_grad == True的张量,那么在反向传播之后x.grad将是
另一个Tensor,它保持x相对于某个标量值的梯度。
"""
device = torch.device('cpu') # CPU环境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU环境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device) # 输入 (64,1000)
y = torch.randn(N, D_out, device=device) # 输出 (64,10)
# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
# 设置require_grad = True意味着我们要在反向传播期间为这些张量计算梯度。
w1 = torch.randn(D_in, H, device=device, requires_grad=True) # 输入层-隐藏层 权重 (1000,100)
w2 = torch.randn(H, D_out, device=device, requires_grad=True) # 隐藏层-输出层 权重 (100,10)
learning_rate = 1e-6 # 学习率
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don't need to keep references to intermediate values.
# 前向传播:使用张量上的运算来计算预测的y。由于w1和w2具有require_grad = True,
# 涉及这些张量的操作将使PyTorch构建计算图,从而允许自动计算梯度。
# 由于我们不再手动实现反向传递,因此不需要保留对中间值的引用。
y_pred = x.mm(w1).clamp(min=0).mm(w2) # (64,10)
# Compute and print loss. Loss is a Tensor of shape (), and loss.item()
# is a Python number giving its value.
loss = (y_pred - y).pow(2).sum()
print(t, loss.item()) # 损失函数
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
# 使用autograd计算反向传递。该调用将计算所有带有require_grad = True的张量的损失梯度。
# 在此调用之后,w1.grad和w2.grad将成为张量,分别保持损失相对于w1和w2的梯度。
loss.backward()
# 也就是说,调用loss.backward()实际上只是产生了所有需要计算梯度的标量,记为w1.grad和w2.grad
# Update weights using gradient descent. For this step we just want to mutate
# the values of w1 and w2 in-place; we don't want to build up a computational
# graph for the update steps, so we use the torch.no_grad() context manager
# to prevent PyTorch from building a computational graph for the updates
# 使用梯度下降更新权重。对于这一步,我们只想就地改变w1和w2的值;我们不想为更新步骤建立计算图,
# 因此我们使用torch.no_grad()上下文管理器来防止PyTorch为更新建立计算图
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after running the backward pass
# 向后传播后手动将梯度归零
w1.grad.zero_()
w2.grad.zero_()
4. PyTorch: Defining new autograd functions
每个原始的autograd运算符实际上都是在Tensor上运行的两个函数。
forward函数从输入张量计算输出张量。
backward函数接收输出张量相对于某个标量值的梯度,并计算输入张量相对于相同标量值的梯度。
在PyTorch中,我们可以通过定义torch.autograd.Function
的子类并实现forward
和backward
函数来轻松定义自己的autograd运算符。然后,我们可以通过构造实例并像调用函数一样调用新的autograd运算符,并传递包含输入数据的张量。
在此示例中,我们定义了自己的自定义autograd函数来执行ReLU非线性,并使用它来实现我们的两层网络:
import torch
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation computes the forward pass using operations on PyTorch
Tensors, and uses PyTorch autograd to compute gradients.
该代码实现使用PyTorch张量上的运算来计算前向传播,并使用PyTorch autograd来计算梯度。
In this implementation we implement our own custom autograd function to perform
the ReLU function.
在该代码中,我们实现了自己的自定义autograd函数来执行ReLU函数。
"""
# 自定义类并继承 torch.autograd.Function
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
我们可以通过继承torch.autograd.Function并实现在Tensor上
运行的前向传播和后向传播来实现自己的自定义autograd函数。
"""
@staticmethod
def forward(ctx, x): # 传入的x是Tensor,ctx是context object
"""
In the forward pass we receive a context object and a Tensor containing the
input; we must return a Tensor containing the output, and we can use the
context object to cache objects for use in the backward pass.
在前向传递中,我们收到一个上下文对象和一个包含输入的张量。
我们必须返回一个包含输出的Tensor,并且我们可以使用上下文对象来缓存对象以用于向后传递。
"""
ctx.save_for_backward(x) # 将输入保存起来,在backward时使用
return x.clamp(min=0) # 返回relu处理后的输出,返回的是Tensor
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive the context object and a Tensor containing
the gradient of the loss with respect to the output produced during the
forward pass. We can retrieve cached data from the context object, and must
compute and return the gradient of the loss with respect to the input to the
forward function.
在后向传播中,我们接收上下文对象和张量,其中包含相对于前向传播期间产生的输出的损耗梯度。
我们可以从上下文对象中检索缓存的数据,并且必须计算损失的梯度并将其相对于输入返回到前向函数。
"""
x, = ctx.saved_tensors # 得到forward保存的Tensor
grad_x = grad_output.clone() # 深拷贝
grad_x[x < 0] = 0 # 计算Relu的微分的方式
# 类比 grad_output相当于grad_h_relu,x相当于h,grad_x相当于grad_h
# grad_h_relu = grad_y_pred.mm(w2.t())
# grad_h = grad_h_relu.clone()
# grad_h[h < 0] = 0
return grad_x
device = torch.device('cpu') # CPU环境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU环境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and output
x = torch.randn(N, D_in, device=device) # 输入 (64,1000)
y = torch.randn(N, D_out, device=device) # 输出 (64,10)
# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
# 输入层-隐藏层 权重 (1000,100)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)
# 隐藏层-输出层 权重 (100,10)
learning_rate = 1e-6 # 学习率
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors; we call our
# custom ReLU implementation using the MyReLU.apply function
# 前向传播:使用张量上的运算来计算预测的y;
# 我们使用MyReLU.apply函数调用自定义ReLU实现
y_pred = MyReLU.apply(x.mm(w1)).mm(w2) # (64,10) 利用自定义的MyReLU类
# 更新前代码 y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.item()) # 损失函数
# Use autograd to compute the backward pass.
loss.backward() # 反向传播
with torch.no_grad():
# Update weights using gradient descent 更新权重
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after running the backward pass
# 向后传播后手动将梯度归零
w1.grad.zero_()
w2.grad.zero_()
5. TensorFlow: Static Graphs
PyTorch autograd看起来很像TensorFlow:在两个框架中我们都定义了一个计算图,并使用自动微分来计算梯度。两者之间的最大区别是TensorFlow的计算图是静态的,而PyTorch使用动态计算图。
在TensorFlow中,我们一次定义了计算图,然后一遍又一遍地执行相同的图,可能将不同的输入数据提供给该图。在PyTorch中,每个前向传播都定义一个新的计算图。
静态图很不错,因为您可以预先优化图。例如,框架可能决定融合某些图操作以提高效率,或者想出一种在多个GPU或许多机器之间分布图形的策略。如果您要一遍又一遍地重用同一张图,那么随着一遍一遍地重复运行同一张图,可以分摊这种潜在的昂贵的前期优化。
静态图和动态图不同的一个方面是控制流。对于某些模型,我们可能希望对每个数据点执行不同的计算。例如,对于每个数据点,循环网络可能会展开不同数量的时间步长;此展开可以实现为循环。对于静态图,循环构造必须是图的一部分;因此,TensorFlow提供了tf.scan
之类的运算符来将循环嵌入图形中。使用动态图,情况更简单:由于我们为每个示例动态生成图,因此可以使用常规命令流控制来执行针对每个输入而不同的计算。
与上面的PyTorch autograd示例形成对比,这里我们使用TensorFlow来拟合一个简单的两层网络:
import tensorflow as tf
import numpy as np
"""
A fully-connected ReLU network with one hidden layer and no biases, trained to
predict y from x by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation uses basic TensorFlow operations to set up a computational
graph, then executes the graph many times to actually train the network.
此实现使用基本的TensorFlow操作来设置计算图,然后多次执行该图以实际训练网络。
One of the main differences between TensorFlow and PyTorch is that TensorFlow
uses static computational graphs while PyTorch uses dynamic computational
graphs.
TensorFlow和PyTorch之间的主要区别之一是TensorFlow使用静态计算图,而PyTorch使用动态计算图。
In TensorFlow we first set up the computational graph, then execute the same
graph many times.
在TensorFlow中,我们首先设置计算图,然后多次执行同一图。
"""
# First we set up the computational graph:
# 首先,我们建立计算图:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in)) # 输入 (64,1000)
y = tf.placeholder(tf.float32, shape=(None, D_out)) # 输出 (64,10)
# tf.placeholder() 占位符,主要为真实输入数据和输出标签的输入,只会分配必要的内存,
# 等建立session,在会话中,运行模型的时候通过feed_dict()函数向占位符喂入数据。
# 对比pytorch的语法规则
# x = torch.randn(N, D_in, device=device)
# y = torch.randn(N, D_out, device=device)
# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H))) # 输入层-隐藏层 权重 (1000,100)
w2 = tf.Variable(tf.random_normal((H, D_out))) # 隐藏层-输出层 权重 (100,10)
# tf.Variable()主要用于定义weights bias等可训练会改变的变量,必须指定初始值。
# tf.constant()创建一个常量。
# 对比pytorch的语法规则
# w1 = torch.randn(D_in, H, device=device, requires_grad=True)
# w2 = torch.randn(H, D_out, device=device, requires_grad=True)
# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# 前向传播:使用TensorFlow张量上的运算来计算预测的y。
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
# 请注意,此代码实际上并不执行任何数字运算。它只是建立了我们稍后将执行的计算图。
h = tf.matmul(x, w1) # 点乘 得到隐藏层 (64,100)
h_relu = tf.maximum(h, tf.zeros(1)) # 计算relu激活函数
y_pred = tf.matmul(h_relu, w2) # 点乘 得到输出层 (64,10)
# 对比pytorch的语法规则
# h = x.mm(w1)
# h_relu = h.clamp(min=0)
# y_pred = h_relu.mm(w2)
# Compute loss using operations on TensorFlow Tensors 计算损失
loss = tf.reduce_sum((y - y_pred) ** 2.0)
# 对比pytorch的语法规则
# loss = (y_pred - y).pow(2).sum()
# Compute gradient of the loss with respect to w1 and w2. 计算梯度
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
# 对比pytorch的语法规则
# loss.backward()
# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
# 使用梯度下降更新权重。要实际更新权重,我们需要在执行图形时评估new_w1和new_w2。
# 请注意,在TensorFlow中,权重值的更新操作是计算图的一部分;在PyTorch中,这发生在计算图之外。
learning_rate = 1e-6 # 学习率
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
# 对比pytorch的语法规则
# with torch.no_grad():
# w1 -= learning_rate * w1.grad
# w2 -= learning_rate * w2.grad
# w1.grad.zero_()
# w2.grad.zero_()
# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
# 现在我们已经建立了计算图,因此我们进入一个TensorFlow会话来实际执行该图。
with tf.Session() as sess: # 执行Session()
# Run the graph once to initialize the Variables w1 and w2.
# 运行一次图以初始化变量w1和w2。
sess.run(tf.global_variables_initializer())
# Create numpy arrays holding the actual data for the inputs x and targets y
# 创建numpy数组来保存输入x和目标y的实际数据
x_value = np.random.randn(N, D_in)
y_value = np.random.randn(N, D_out)
for _ in range(500):
# Execute the graph many times. Each time it executes we want to bind
# x_value to x and y_value to y, specified with the feed_dict argument.
# Each time we execute the graph we want to compute the values for loss,
# new_w1, and new_w2; the values of these Tensors are returned as numpy
# arrays.
# 执行多次图。每次执行时,我们都希望将x_value绑定到x上,并将y_value绑定到y上
# (由feed_dict参数指定)。每次执行该图时,我们都要计算损失值new_w1和new_w2;
# 这些张量的值以numpy数组形式返回。
loss_value, _, _ = sess.run([loss, new_w1, new_w2],
feed_dict={x: x_value, y: y_value})
print(loss_value)
6. PyTorch: nn
计算图和autograd是定义复杂运算符并自动采用导数的非常强大的范例。但是对于大型神经网络,原始的autograd可能会有点太低级了。
在构建神经网络时,我们经常考虑将计算分为几层,其中一些具有可学习的参数,这些参数将在学习过程中进行优化。
在TensorFlow中,像Keras,TensorFlow-Slim和TFLearn这样的软件包在原始计算图上提供了更高级别的抽象,这些抽象对构建神经网络很有用。
在PyTorch中,nn
包可达到相同的目的。nn
包定义了一组模块,这些模块大致等效于神经网络层。模块接收输入张量并计算输出张量,但也可以保持内部状态,例如包含可学习参数的张量。nn
包还定义了一组有用的损失函数,这些函数通常在训练神经网络时使用。
在此示例中,我们使用nn
包来实现我们的两层网络:
import torch
"""
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation uses the nn package from PyTorch to build the network.
PyTorch autograd makes it easy to define computational graphs and take gradients,
but raw autograd can be a bit too low-level for defining complex neural networks;
this is where the nn package can help. The nn package defines a set of Modules,
which you can think of as a neural network layer that has produces output from
input and may have some trainable weights or other state.
这个实现使用来自PyTorch的nn包来构建网络。PyTorch autograd使得定义计算图和获取梯度变得容易,
但是原始的autograd对于定义复杂的神经网络来说可能太低级了。这是nn软件包可以提供帮助的地方。
nn包定义了一组模块,您可以将其视为神经网络层,该神经网络层从输入产生输出并且可能具有一些
可训练的权重或其他状态。
"""
device = torch.device('cpu') # CPU环境
# device = torch.device('cuda') # Uncomment this to run on GPU GPU环境
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device) # 输入 (64,1000)
y = torch.randn(N, D_out, device=device) # 输出 (64,10)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
# 使用nn包将我们的模型定义为一系列图层。
# nn.Sequential是一个包含其他模块的模块,应用这些内置模块可以直接得到其输出。
# 每个线性模块都使用线性函数来计算输入的输出,并保留内部张量用于其权重和偏差。
# 构建模型后,我们使用.to()方法将其移动到所需的设备。
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H), # 隐藏层 (64,100)
torch.nn.ReLU(), # 计算relu激活函数 (64,100)
torch.nn.Linear(H, D_out), # 输出层 (64,10)
).to(device)
# 相当于取代了这三行代码
# h = x.mm(w1)
# h_relu = h.clamp(min=0)
# y_pred = h_relu.mm(w2)
# 写在for循环外面,在for循环里面只需要调用模型就可以了
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
# nn包还包含流行的损失函数的定义;在这种情况下,我们将使用均方误差(MSE)作为损失函数。
# 设置reduction ='sum'意味着我们正在计算平方误差的* sum *而不是均值;
# 这是为了与上面的示例(我们手动计算损失)保持一致,
# 但是在实践中,更常见的是通过设置reducer ='elementwise_mean'将均方误差用作损失。
loss_fn = torch.nn.MSELoss(reduction='sum')
# 相当于取代了 loss = (y_pred - y).pow(2).sum()
# 写在for循环外面,在for循环里面直接调用损失函数就可以了
learning_rate = 1e-4 # 学习率
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
# 前向传播:通过将x传递给模型来计算预测的y。模块对象会覆盖__call__运算符,
# 因此您可以像调用函数一样调用它们。这样做时,您将输入数据的张量传递给模块,
# 它会产生输出数据的张量。
y_pred = model(x) # 直接调用
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the loss.
# 计算和打印损失。我们传递包含y的预测值和真实值的Tensor,损失函数返回包含损失的Tensor。
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Zero the gradients before running the backward pass. 梯度归零
model.zero_grad()
# 替换原来的两行代码
# w1.grad.zero_()
# w2.grad.zero_()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
# 反向传播:相对于模型的所有可学习参数计算损耗的梯度。
# 在内部,每个模块的参数都存储在Tensors中,
# 其中require_grad = True,因此此调用将计算模型中所有可学习参数的梯度。
loss.backward()
# 在执行loss.backward()之前要先清空梯度
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its data and gradients like we did before.
# 使用梯度下降更新权重。每个参数都是一个Tensor,因此我们可以像以前一样访问其数据和渐变。
with torch.no_grad():
for param in model.parameters(): # 输出模型的参数
param.data -= learning_rate * param.grad
7. PyTorch: optim
到目前为止,我们已通过手动更改持有可学习参数的张量来更新模型的权重。对于像随机梯度下降这样的简单优化算法来说,这并不是一个沉重的负担,但是在实践中,我们经常使用更复杂的优化器(如AdaGrad,RMSProp,Adam等)来训练神经网络。
PyTorch中的optim
软件包抽象了优化算法的思想,并提供了常用优化算法的实现。
在此示例中,我们将像以前一样使用nn
包定义模型,但是我们将使用optim
包提供的Adam算法优化模型:
import torch
"""
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation uses the nn package from PyTorch to build the network.
该实现使用来自PyTorch的nn软件包来构建网络。
Rather than manually updating the weights of the model as we have been doing,
we use the optim package to define an Optimizer that will update the weights
for us. The optim package defines many optimization algorithms that are commonly
used for deep learning, including SGD+momentum, RMSProp, Adam, etc.
与其像我们一直在手动更新模型的权重,不如使用optim包定义一个优化器,该优化器将为我们更新权重。
optim软件包定义了许多深度学习常用的优化算法,包括SGD + momentum,RMSProp,Adam等。
"""
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in) # 输入 (64,1000)
y = torch.randn(N, D_out) # 输出 (64,10)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
) # 定义网络模型
loss_fn = torch.nn.MSELoss(reduction='sum') # 定义损失函数
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
# 使用optim包定义一个优化器,该优化器将为我们更新模型的权重。在这里,我们将使用Adam;
# optim程序包包含许多其他优化算法。Adam构造函数的第一个参数告诉优化器应该更新哪个张量。
learning_rate = 1e-4 # 学习率
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 优化器 torch.optim,第一个参数是需要更新的系数,第二个参数是学习率
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x) # 调用网络模型进行前向传播
# Compute and print loss.
loss = loss_fn(y_pred, y) # 计算损失
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the Tensors it will update (which are the learnable weights
# of the model)
optimizer.zero_grad() # 梯度归零
# Backward pass: compute gradient of the loss with respect to model parameters
loss.backward() # 后向传播
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step() # 调用优化器更新参数
总结
- 调用模型前向传播 y_pred = model(x)
- 调用损失函数 loss = loss_fn(y_pred, y)
- 梯度归零 optimizer.zero_grad()
- 后向传播 loss.backward()
- 调用优化器更新参数 optimizer.step()
8. PyTorch: Custom nn Modules
有时,您将需要指定比一系列现有模块更复杂的模型。对于这些情况,您可以通过子类nn.Module
并定义一个forwad
输入来定义自己的模块,该前向接收输入张量并使用其他模块或在张量上的其他自动转换操作产生输出张量。
在此示例中,我们将两层网络实现为自定义的Module
子类:
import torch
"""
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
一个全连接网络模型,激活函数是ReLU,具有一个隐藏层且没有偏差,经过训练可以使用欧几里得误差根据x来预测y。
This implementation defines the model as a custom Module subclass. Whenever you
want a model more complex than a simple sequence of existing Modules you will
need to define your model this way.
此实现将模型定义为自定义Module子类。每当您想要一个比现有模块的简单序列更复杂的模型时,
都需要以这种方式定义模型。
"""
# 自定义类,继承torch.nn.Module
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out): # 构造函数,self是固定的,然后传入的系数
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
在构造函数中,我们实例化两个nn.Linear模块并将其分配为成员变量。
"""
super(TwoLayerNet, self).__init__() # 固定写法
self.linear1 = torch.nn.Linear(D_in, H) # 线性层 1
self.linear2 = torch.nn.Linear(H, D_out) # 线性层 2
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary (differentiable) operations on Tensors.
在前向函数中,我们接受输入数据的张量,并且必须返回输出数据的张量。
我们可以使用构造函数中定义的模块以及张量上的任意(可微分)操作。
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in) # 输入 (64,1000)
y = torch.randn(N, D_out) # 输出 (64,10)
# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out) # 实例化模型,对应__init__里面的参数输入
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum') # 损失函数
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4) # 优化器
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # 调用模型,对应的是forward里面的参数
# Compute and print loss
loss = loss_fn(y_pred, y) # 调用损失函数
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad() # 梯度归零
loss.backward() # 后向传播
optimizer.step() # 更新参数
自定义类继承nn.Module
也是最常用的操作了。
一定要搞清楚传入的参数,什么时候是实例化模型(对应init),什么时候是调用模型(对应forward)
9. PyTorch: Control Flow and Weight Sharing
作为动态图和权重共享的示例,我们实现了一个非常奇怪的模型:一个完全连接的ReLU网络,该网络在每个前向传播中选择1到4之间的随机数,并使用那么多隐藏层,多次重复使用相同的权重计算最里面的隐藏层。
对于此模型,可以使用常规的Python流控制来实现循环,并且可以通过在定义前向传播时简单地多次重复使用同一模块来实现最内层之间的权重共享。
我们可以轻松地将此模型实现为Module子类:
import random
import torch
"""
To showcase the power of PyTorch dynamic graphs, we will implement a very strange
model: a fully-connected ReLU network that on each forward pass randomly chooses
a number between 1 and 4 and has that many hidden layers, reusing the same
weights multiple times to compute the innermost hidden layers.
为了展示PyTorch动态图的强大功能,我们将实现一个非常奇怪的模型:一个完全连接的ReLU网络,
该网络在每个前向传递上随机选择一个1到4之间的数字,并且具有许多隐藏层,多次重复使用相同
的权重计算最里面的隐藏层。
"""
# 自定义神经网咯
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out): # 初始化
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__() # 固定用法
self.input_linear = torch.nn.Linear(D_in, H) # 输入层
self.middle_linear = torch.nn.Linear(H, H) # 中间层
self.output_linear = torch.nn.Linear(H, D_out) # 输出层
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
对于模型的前向传播,我们随机选择0、1、2或3,然后多次重复使用middle_linear模块
来计算隐藏层表示。
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
由于每个前向传播都会构建一个动态计算图,因此在定义模型的前向传播时,我们可以
使用诸如循环或条件语句之类的常规Python控制流运算符。
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
在这里,我们还看到,在定义计算图时,多次重用同一模块是绝对安全的。
这是对Lua Torch的一项重大改进,Lua Torch的每个模块只能使用一次。
"""
h_relu = self.input_linear(x).clamp(min=0) # 输入层
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0) # 随机调用中间层
y_pred = self.output_linear(h_relu) # 输出层
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in) # 输入 (64,1000)
y = torch.randn(N, D_out) # 输出 (64,10)
# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out) # 实例化模型
# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum') # 损失函数
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9) # 优化器
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # 调用模型
# Compute and print loss
loss = criterion(y_pred, y) # 计算损失
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad() # 梯度归零
loss.backward() # 后向传播
optimizer.step() # 更新权重