岭回归算法是一种专门用于共线性数据分析的有偏估计回归方法, 实际上是一种改 良的最小 乘估计法,通过放弃最小 乘法的无偏性,以损失部分信息、降低精度为代 价,获得回归系数更符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘 。
在 scikit-leam 中实现岭回归算法的是 Ridge 类。
我们用波士顿数据集来验证一下岭回归算法
from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import Ridge #导入数据 filename = 'data/boston_housing.csv' names =['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PRTATIO','B','LSTAT','MEDV'] data = read_csv (filename , names=names, delim_whitespace=False) data=data.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) #将数据分为输入数据和输出结果 array = data.values X =array[:,0:13] Y =array[:,13] n_splits = 10 seed = 7 kfold = KFold(n_splits=n_splits, random_state=seed,shuffle=True) model = Ridge () scoring= 'neg_mean_squared_error' result= cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print('Ridge Regression:%.3f' % result.mean())
运行结果如下:
PS C:\coding\machinelearning> c:/coding/machinelearning/岭回归算法-波士顿数据集.py Ridge Regression:-22.304 PS C:\coding\machinelearning>