Back to Regression with PyTorch

首先，一个简单的，烂大街的 Linear Regression 模型，用 PyTorch 实现：

# 假定已经引入了需要的库，并且有一个数据集 inputs 和 labels
dataLoader = DataLoader(TensorDataset(inputs, labels), batch_size=32, shuffle=True)

class LinearRegression(nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

model = LinearRegression()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(1000):
    for inputs, labels in dataLoader:
        optimizer.zero_grad()
        loss = criterion(model.forward(inputs), labels)
        loss.backward()
        optimizer.step()

乍看似乎很简单，但我们可以提出几个问题：

损失函数 criterion 和优化器 optimizer 似乎并没有直接的联系，那么优化器是如何知道具体每个 step 的损失是多少呢？更进一步，优化器是如何知道如何更新参数的呢？
SGD 的字面意思应该是 Stochastic Gradient Descent，但是 SGD 并没有接触到数据集，它如何保证每个 step 中取的 batch 大小？

省流

我这里先给出答案，然后再解释原因：

损失函数 criterion 和优化器 optimizer 之间的联系是通过 loss.backward() 这一步建立的，损失函数的反向传播会计算出每个参数的梯度，然后优化器根据参数的梯度来更新参数
SGD 的名字是有误导性的，实际上，PyTorch 的 SGD 更应该叫做 GD，因为真正的随机性是由 DataLoader 提供的，每个 step 都会从 DataLoader 中取出一个 batch 的数据，然后计算梯度，更新参数，到底是 Batch、Mini-Batch 还是 Stochastic Gradient Descent 取决于 DataLoader 的 batch_size

Tensor

为了解答第一个问题，我们首先要来到引擎盖底下，看看 torch.nn.Linear 是基于什么实现的。我们可以通过查看源码来了解：

下面的源代码我删去了一些东西来便于理解

class Linear(Module):
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: Tensor
    bias: Optional[Tensor]

    def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.empty((out_features, in_features)))
        if bias:
            self.bias = Parameter(torch.empty(out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias)

总之，我们只需要记住，weight 和 bias，即 $y = X \cdot W^T + b$ 中的 $W^T$ 和 $b$ 实际上是一个 torch.Tensor 类型的对象

确切的说，weight 和 bias 是 torch.nn.Parameter 类型的对象，这个类继承自 torch.Tensor，并且会被自动加入到模型的参数列表中

那么，什么是张量 (Tensor) 呢？

根据 PyTorch 文档介绍，A torch.Tensor is a multi-dimensional matrix containing elements of a single data type.

然而这并不能解释我们的疑惑，如果只是多维矩阵的话，那么为什么不直接使用 numpy.ndarray 呢？诚然，torch.Tensor 可以在 GPU 上运行，但是这并不是它的全部。

我说的就是 torch.Tensor 的另一个特性：自动求导

新手教程的Automatic Differentiation with torch.autograd 说的很好，我就不再赘述了

简而言之，当我们进行操作张量时，比如对 torch.Tensor 对象执行操作（例如加法、乘法等）时，PyTorch 会在后台自动构建一个计算图，这个图记录了所有的操作及其依赖关系。

这些张量共享同一张计算图，这样就可以通过调用张量类的 backward() 方法来反向传播梯度，计算出每个参数的梯度。

举例我们有 $y = X \cdot W^T + b$ ，又有损失函数 $L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

那么，对于最后的节点 $L$ ，我们可以算出 $\frac{\partial L}{\partial y}$ ，我们将这个偏导数传递给上一个节点 $y$ ，而这里可以算出 $\frac{\partial y}{\partial W}$ 和 $\frac{\partial y}{\partial b}$

由链式法则，我们可以算出 $\frac{\partial L}{\partial W}$ 和 $\frac{\partial L}{\partial b}$ ，我们将这两个梯度设置给 $W^T$ 和 $b$

可以看到，每个张量只需要记住上次的操作和依赖关系，就可以计算出自己的梯度

实际上，如果只是一个 Demo，我们完全可以直接自己实现

class LinearRegression:
    def __init__(self):
        self.w = torch.randn(size=(1, 1), requires_grad=True)
        self.b = torch.randn(size=(1, 1), requires_grad=True)

    def forward(self, x: Tensor) -> Tensor:
        # qus: b 的形状是 (1, 1) ，而 w * x 的形状是 (100, 1)，为什么这里能相加？
        # ans: 因为 pytorch 的广播 （broadcasting）机制，会自动将 (1, 1) 扩展为 (100, 1) 与 (100, 1) 相加
        return self.w * x + self.b

    def __call__(self, x: Tensor) -> Tensor:
        return self.forward(x)

    def parameters(self) -> List[Tensor]:
        return [self.w, self.b]


def mean_square_loss(y_pred: Tensor, y: Tensor) -> Tensor:
    return torch.mean((y_pred - y) ** 2)

class Optimizer:
    def __init__(self, parameters, lr: float):
        self.parameters = parameters
        self.lr = lr
    
    def step(self) -> None:
        with torch.no_grad():
            for param in self.parameters:
                param -= self.lr * param.grad

    def zero_grad(self) -> None:
        for param in self.parameters:
            param.grad.zero_()

SGD in torch.optim is actually GD

对于第二个问题，我们可以看看 StackOverflow 上的一个回答

Your understanding is correct. SGD is just updating weights based on the gradient computed by back propagation. The flavor of gradient descent that it performs is therefore determined by the data loader.

Gradient descent (aka batch gradient descent): Batch size equal to the size of the entire training dataset.
Stochastic gradient descent: Batch size equal to one and shuffle=True.
Mini-batch gradient descent: Any other batch size and shuffle=True. By far the most common in practical applications.

还可以参考 PyTorch Forums 上的讨论贴

我觉得这个回复说的很好：

Yeah - newcomer to PyTorch here and I find the SGD name really confusing too. I understand SGD as gradient descent with a batch size of 1, but in reality the batch size is determined by the user. So I agree that it would be much less confusing if it was named just GD because that’s what it is.

总之，PyTorch 的 SGD 优化器实际上是一个 GD 优化器，真正的随机性是由 DataLoader 提供的，每个 step 都会从 DataLoader 中取出一个 batch 的数据，然后计算梯度，更新参数，到底是 Batch、Mini-Batch 还是 Stochastic Gradient Descent 取决于 DataLoader 的 batch_size

相同的奇怪的名字还有 torch.nn.CrossEntropyLoss，它实际上是一个 softmax + negative log likelihood loss，而不只是交叉熵损失

参考这个 StackOverflow 上的问题中的一个回答

I would like to add an important note, as this often leads to confusion.

Softmax is not a loss function, nor is it really an activation function. It has a very specific task: It is used for multi-class classification to normalize the scores for the given classes. By doing so we get probabilities for each class that sum up to 1.

Softmax is combined with Cross-Entropy-Loss to calculate the loss of a model.

Unfortunately, because this combination is so common, it is often abbreviated. Some are using the term Softmax-Loss, whereas PyTorch calls it only Cross-Entropy-Loss.