与逻辑回归类似,Softmax回归也是用于解决分类问题。不同的是,逻辑回归主要用于解决二分类问题,多分类问题需要通过OvO、MvM、OvR等策略来解决,softmax回归则是直接用于解决多分类问题。
在二分类问题中,可以直接使用{0,1}来标注标签\(y\),但是在多分类问题中,我们需要寻找其他的表示方法。对于类别,{婴儿,儿童,青少年,青年人,中年人,老年人} ,很自然的想到使用{1,2,3,4,5,6}来标注标签,对于这个例子当然是合适的,因为各个类别之间有明显的顺序关系,这也是有意义的。但是对于{铅笔,钢笔,签字笔}这个例子,直接使用带有顺序的数字标签是不合理的。因此,通常情况下,可以选择Onehot编码:
\[y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}. \]与逻辑回归类似,softmax也是基于线性回归,对于每一个样本,对每一个类别,预测出一个数值,然后使用softmax函数,将其转换成“概率”,然后在所有类别中,选择预测“概率”最大的值作为预测类别。
\[\vec{o_i} = W\vec{x_i} \]用非矩阵表示就是:
\begin{split}\begin{aligned}
o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\
o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\
o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
\end{aligned}\end{split}
用神经网络图可以更清晰的表示这个过程:
预测出的结果也就是向量\(\vec{o}\),其元素值是在整个实数空间的,因此,使用softmax对其进行变换,使其转化为可以理解为概率的形式:
\[\hat{y_i} = Softmax(\vec{o_i}) \quad {其中}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum\limits_{a=1}^k \exp(o_a)}\quad j = 1,2,3...k \]举个例子,比如\(\vec{o_i}={(1,2,3)}\),则\(\hat{y_i}=(\frac e{e+e^2+e^3},\frac {e^2}{e+e^2+e^3},\frac {e^3}{e+e^2+e^3})\),可以看出,向量\(\hat{y_i}\)的各元素和为1,因此可以理解为样本\(i\)被分类为三个类别所对应的概率,我们选择概率最大的类别作为最终的分类类别。
上一节介绍了Softmax的基本思路,现在要解决的问题就是如何通过计算获得参数矩阵\(W\)和参数向量\(\vec{b}\)。假设我们已经获取了样本容量为\(m\),特征数为\(n\)的样本矩阵\(X_{m\times n}\),以及对应的标签矩阵\(Y_{m\times k}\),其中,分类类别数为\(k\)。与逻辑回归类似,由MLE可得损失函数为
\[L=\prod\limits_{i=1}^mP(\vec{y_i}|\vec{x_i}) \]对数损失函数为
\[-lnL = \sum\limits_{i=1}^m-lnP(\vec{y_i}|\vec{x_i})=\sum\limits_{i=1}^ml(\vec{y_i},\hat {y}_i) \]其中
\[l(\vec y_i,\hat y_i)=-\vec y_i ln\hat y_i=-\sum\limits_{j=1}^k y_j^{(i)}ln(\hat y_j^{(i)}) \]\(\vec y_i\)是第\(i\)个样本的\(label\)向量,\(\hat y_j^{(i)}\)是第\(i\)个样本的预测向量的第\(j\)项。
可以看出,对数损失函数中的条件概率其实是我们预测出的概率向量在对应的Onehot为1的位置的概率值,可以将其巧妙的表示为\(l(y_i,\hat y_i)\)的表达式。
对\(l(y_i,\hat y_i)\)做如下化简:
\[\begin{split}l(y_i,\hat y_i)&=-\sum\limits_{j=1}^k y_jln\frac{exp(o_j)}{\sum\limits_{a=1}^k exp(o_a)}\\ &=\sum\limits_{j=1}^k y_jln\frac{\sum\limits_{a=1}^k exp(o_a)}{exp(o_j)}\\ &=\sum\limits_{j=1}^k y_jln{\sum\limits_{a=1}^k exp(o_a)}-{\sum\limits_{j=1}^k y_jo_j}\\ &=ln{\sum\limits_{a=1}^k exp(o_a)}-{y_bo_b}\end{split} \]则有
\[\frac {\partial l(y_i,\hat y_i)}{\partial o_j}=\frac{exp(o_j)}{\sum\limits_{a=1}^kexp(o_a)}-y_j=Softmax(o_j)-y_j \]改写为向量形式就是
\[\frac {\partial l(y_i,\hat y_i)}{\partial \vec o_i}=Softmax(\vec o_i)-\vec y_i \]由于
\[dl_i = tr(\frac{\partial l_i}{\partial \vec o_i}^T d\vec o_i)= tr(\frac{\partial l_i}{\partial \vec o_i}^T dW\vec x_i)= tr((\frac{\partial l_i}{\partial \vec o_i}\vec x_i^T)^T dW) \]故
\[\frac{\partial l_i}{\partial W}=\frac{\partial l_i}{\partial \vec o_i}\vec x_i^T=[Softmax(W\vec x_i)-\vec y_i]x_i^T \]\[\frac {\partial (-lnL)}{\partial W}=\sum\limits_{i=1}^m[Softmax(W\vec x_i)-\vec y_i]x_i^T=[Softmax(WX^T)-y^T]X \]\[W=W-\alpha[Softmax(WX^T)-y^T]X \]其中
\[X=\begin{bmatrix} x_1^T \\ x_2^T \\...\\x_m^T\end{bmatrix},y=\begin{bmatrix} y_1^T \\ y_2^T\\ ...\\y_m^T\end{bmatrix} \]\(x_i,y_i均为列向量\)
这里使用李沐老师课程里用到的图片数据集。
import matplotlib.pyplot as plt %matplotlib inline import torch import torchvision from torch.utils import data from torchvision import transforms import warnings warnings.filterwarnings('ignore')
# 通过ToTensor实例将图像数据从PIL类型变换成32位浮点数格式 # 并除以255使得所有像素的数值均在0到1之间 trans = transforms.ToTensor() mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True, transform=trans, download=True) mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False, transform=trans, download=True)
下载完成数据集后,可以看出训练集中有60000条数据,mnist_train[0]可以获取第一张图片的信息,它包含两项,第一项是图片的矩阵数据,是一个[1,28,28]的矩阵,第二项是label标签,可以使用plt.imshow来查看图片。
len(mnist_train)
60000
len(mnist_train[0])
2
mnist_train[0][0].shape
torch.Size([1, 28, 28])
mnist_train[0][1]
9
plt.imshow(mnist_train[0][0][0]) plt.show()
sklearn中LogisticsRegression类中,参数multi_class设置为multinomial时,使用的就是softmax回归。
from sklearn.linear_model import LogisticRegression # 数据获取 X_train,y_train = next(iter(data.DataLoader(mnist_train,batch_size=len(mnist_train)))) X_train = X_train.reshape((len(mnist_train),-1)) X_test,y_test = next(iter(data.DataLoader(mnist_test,batch_size=len(mnist_test)))) X_test = X_test.reshape((len(mnist_test),-1)) # 模型训练 soft_sk = LogisticRegression(multi_class='multinomial').fit(X_train.numpy(),y_train.numpy()) # 评分 soft_sk.score(X_train,y_train),soft_sk.score(X_test,y_test)
(0.8659833333333333, 0.8438)
使用梯度下降按照第二节中的方法优化参数,由于涉及计算Softmax很容易溢出,因此设置了很小的学习率。但是,性能一直无法优化到80%,希望有大佬指教一下。
import numpy as np from torch.utils import data import random from torchvision import transforms import torchvision import pandas as pd class Softmax: def __init__(self, X, y, batch_size=5, epoch=3, alpha=0.00001): self.features = np.array(np.insert(X, 0, 1, axis=1)) self.labels_original = y self.labels = pd.get_dummies(self.labels_original).values self.batch = batch_size self.epoch = epoch self.alpha = alpha self.n_class = len(y.unique()) self.n_features = self.features.shape[1] self.W = np.random.normal(0, 0.01, (self.n_class, self.n_features)) def softmax(self, X): X = np.array(X) X = X - X.max() return np.exp(X)/np.sum(np.exp(X), axis=1, keepdims=True) def data_iter(self): range_list = np.arange(self.features.shape[0]) random.shuffle(range_list) for i in range(0, len(range_list), self.batch): batch_indices = range_list[i:min(i + self.batch, len(range_list))] yield self.features[batch_indices], self.labels[batch_indices] def fit(self): for i in range(self.epoch): for X, y in self.data_iter(): self.W -= self.alpha * np.matmul((self.softmax(np.matmul(self.W, X.T))-y.T), X) def predict(self, X_pre): X_pre = np.array(np.insert(X_pre, 0, 1, axis=1)) return np.argmax(self.softmax(np.matmul(self.W, X_pre.T)), axis=0) def score(self, y_true, y_pre): return np.sum(np.ravel(y_true) == np.ravel(y_pre))/len(y_true) def main(): trans = transforms.ToTensor() mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True, transform=trans, download=True) mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False, transform=trans, download=True) X_train, y_train = next(iter(data.DataLoader(mnist_train, batch_size=len(mnist_train)))) X_train = X_train.reshape((len(mnist_train), -1)) X_test, y_test = next(iter(data.DataLoader(mnist_test, batch_size=len(mnist_test)))) X_test = X_test.reshape((len(mnist_test), -1)) soft_max = Softmax(X_train, y_train) soft_max.fit() y_train_pre = soft_max.predict(X_train) y_test_pre = soft_max.predict(X_test) print(f"训练集准确度:{soft_max.score(y_train,y_train_pre)}") print(f"测试集准确度:{soft_max.score(y_test, y_test_pre)}") if __name__ == '__main__': main()
训练集准确度:0.6503333333333333 测试集准确度:0.6419
import torch import torchvision from torch.utils import data from torch import nn from torchvision import transforms class SoftmaxPytorch: def __init__(self, X, y, batch_size=256, epoch=5, lr=0.1): self.features = torch.tensor(X) self.labels = torch.tensor(y).reshape(-1, 1) self.batch = batch_size self.epoch = epoch self.lr = lr self.n_features = self.features.shape[1] self.n_class = len(self.labels.unique()) self.loss = nn.CrossEntropyLoss() self.net = nn.Sequential(nn.Flatten(), nn.Linear(self.n_features, self.n_class)) self.trainer = torch.optim.SGD(self.net.parameters(), self.lr) def data_iter(self): dataset = data.TensorDataset(self.features, self.labels) return data.DataLoader(dataset, self.batch, shuffle=True) def init_weights(self, model): if type(model) == nn.Linear: nn.init.normal_(model.weight, std=0.01) def fit(self): self.net.apply(self.init_weights) for i in range(self.epoch): for X, y in self.data_iter(): y_hat = self.net(X) l = self.loss(y_hat, y.ravel()) self.trainer.zero_grad() l.sum().backward() self.trainer.step() print(f'epoch:{i},loss:{self.loss(self.net(self.features), self.labels.ravel())}') def predict(self, X_pre): y_hat = self.net(X_pre) y_pre = torch.argmax(y_hat, axis=1) return y_pre def score(self, y_hat, y_true): return sum(y_hat.type(y_true.dtype).ravel() == y_true.ravel())/len(y_true) def main(): trans = transforms.ToTensor() mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True, transform=trans, download=True) mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False, transform=trans, download=True) X_train, y_train = next(iter(data.DataLoader(mnist_train, batch_size=len(mnist_train)))) X_train = X_train.reshape((len(mnist_train), -1)) X_test, y_test = next(iter(data.DataLoader(mnist_test, batch_size=len(mnist_test)))) X_test = X_test.reshape((len(mnist_test), -1)) sf = SoftmaxPytorch(X_train, y_train) sf.fit() y_train_pre = sf.predict(X_train) train_score = sf.score(y_train_pre, y_train) y_test_pre = sf.predict(X_test) test_score = sf.score(y_test_pre, y_test) print(f'训练集准确率:{train_score}') print(f'测试集准确率:{test_score}') if __name__ == '__main__': main()
epoch:0,loss:0.6310750842094421 epoch:1,loss:0.5460468530654907 epoch:2,loss:0.5175894498825073 epoch:3,loss:0.49569806456565857 epoch:4,loss:0.473165899515152 训练集准确率:0.84211665391922 测试集准确率:0.8271999955177307