Sklearn中的交叉验证

本文由 简悦 SimpRead 转码, 原文地址 https://www.jianshu.com/p/a4e94e72a46d

对于线性回归:
方法一:以前的 cross validation 中有一种方法是 train/test split,现在挪到 model_selection 库中,randomly partition the data into training and test sets, by default, 25 percent of the data is assigned to the test set。这种方法只能得到一次划分结果的评估结果,不准确。

1
2
3
4
5
// from sklearn.model_selection import train_test_split  
// X_train,X_test,y_train,y_test=train_test_split(X,y)
// model=LinearRegression()
// model.fit(X,y)
// model.score(X_test,y_test)

方法二:用 model_selection 库中的 cross_val_score

1
2
3
4
5
6
// from sklearn.model_selection import cross_val_score  
// model=LinearRegression()
// scores=cross_val_score(model,X,y,cv=5)

//cv=ShuffleSplit(n_splits=3,test_size=0.3,random_state=0)
//cross_val_score(model, X,y, cv=cv)

对于逻辑回归:
逻辑回归用于处理分类问题,线性回归求解 how far it was from the decision boundary(求距离)的评估方式明显不适合分类问题。
The most common metrics are accuracy, precision, recall, F1 measure, true negatives, false positives and false negatives
1、计算 confusion matrix
Confusion matrix 由 true positives, true negatives, false positives 以及 false negatives 组成。

1
// confusion_matrix=confusion_matrix(y_test, y_pred)

2、accuracy: measures a fraction of the classifier’s predictions that are correct.

1
2
// accuracy_score(y_true,y_pred)  
LogisticRegression.score() 默认使用 accuracy

3、precision: 比如说我们预测得了 cancer 中实际确实得病的百分比

1
2
3
// classifier=LogisticRegression()  
// classifier.fit(X_train,y_train)
// precisions= cross_val_score(classifier, X_train,y_train,cv=5,scoring='precision')

4、recall: 比如说实际得了 cancer,被我们预测出来的百分比

1
// recalls= cross_val_score(classifier,X_train,y_train,cv=5,scoring='recall')

5、precision 和 recall 之间是一个 trade-off 的关系,用 F1score 来表征性能,F1score 越高越好

1
// fls=cross_val_score(classifier, X_train, y_train, cv=5,scoring='f1')

6、ROC 曲线和 AUC 的值
ROC 曲线的横坐标为 false positive rate(FPR), 纵坐标为 true positive rate(TPR)
AUC 数值 = ROC 曲线下的面积

1
2
3
4
5
// classifier=LogisticRegression()  
// classifier.fit(X_train, y_train)
// predictions = classifier.predict_proba(X_test)
// false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:,1])
// roc_auc=auc(false_positive_rate, recall)