零基础入门数据挖掘之金融风控-贷款违约预测
摘要1.数据概况2.数据读取3.分类指标评价计算摘要
在实践中学,很高兴有这次机会,与志同道合的小伙伴一起学习,本次主要通过天池实际比赛项目学习数据挖掘相关理论知识及分析流程。本文主要内容:解赛题数据和目标,清楚评分体系。 [天池比赛地址:](/competition/entrance/531830/introduction)1.数据概况
说白了就是看看有多少数据,都有什么数据,那些字段,都是什么意思;然后充分理解赛题的业务目标,选择合适的方法进行分析,本赛题数据大概100多万,近50个字段,如果用监督学习要将样本分为训练集和测试集,通常按照8:2的比例进行拆分,比赛数据已经拆分好了,不用再次处理。
2.数据读取
读取数据文件,并查看数据基本情况
代码如下:
import pandas as pdtrain = pd.read_csv('train.csv') #读取文件testA = pd.read_csv('testA.csv')print("train data shape:",train.shape) #查看数据大小,输出(行,列)print("testA data shape:",testA.shape)pd.set_option('display.max_columns', None) # 显示所有列设置#pd.set_option('display.max_rows', None)train.head(10) #读取前10行数据,不传参默认显示前5行plot as pltimport seaborn as snsimport warningswarnings.filterwarnings('ignore')import sslssl._create_default_https_context = ssl._create_unverified_context
结果如下:
3.分类指标评价计算
分类算法常见的评估指标如下:
1、混淆矩阵(Confuse Matrix)
2、准确率(Accuracy)
3、精确率(Precision)
4、召回率(Recall)
5、F1 Score
6、P-R曲线(Precision-Recall Curve)
7、ROC(Receiver Operating Characteristic)
8、AUC(Area Under Curve)
对于金融风控预测类常见的评估指标如下:
1、KS(Kolmogorov-Smirnov)
2、ROC
3、AUC
各类评估指标实现代码如下:
## 1.计算并输出混淆矩阵import numpy as npfrom sklearn.metrics import confusion_matrixy_pred = [0,1,0,1,1,0,1,1,1,1] #预测值y_true = [0,1,1,0,0,0,1,0,0,1] #真实值print('混淆矩阵:\n',confusion_matrix(y_true,y_pred))## 2.accuracyfrom sklearn.metrics import accuracy_scorey_pred = [0,1,0,1]y_true = [0,0,0,1]print('ACC:',accuracy_score(y_true,y_pred))## 3.precision,Recall,F1-scorefrom sklearn import metricsy_pred = [0,1,0,1,1,0,1,1,1,1]y_true = [0,1,1,0,0,0,1,0,0,1]print('precision',metrics.precision_score(y_true,y_pred))print('Reall',metrics.recall_score(y_true,y_pred))print('F1-score:',metrics.f1_score(y_true,y_pred))## 4.P-R曲线import matplotlib.pyplot as pltfrom sklearn.metrics import precision_recall_curvey_pred = [0,1,1,0,1,1,0,1,1,1]y_true = [0,0,0,0,1,0,1,1,1,0]precision,recall,thresholds = precision_recall_curve(y_true,y_pred)plt.plot(precision,recall)## 5.ROC 曲线from sklearn.metrics import roc_curvey_pred = [1,0,1,1,1,0,0,0,1,0]y_true = [1,0,0,0,1,1,0,1,1,0]FPR,TPR,thresholds = roc_curve(y_true,y_pred)plt.title('ROC')plt.plot(FPR,TPR,'b')plt.plot([0,1],[0,1],'r--')plt.ylabel('TPR')plt.xlabel('FPR')Text(0.5,0,'FPR')## 6.AUC import numpy as npfrom sklearn.metrics import roc_auc_scorey_true = np.array([0,1,1,1,0,0])y_scores = np.array([0.1,0.5,0.45,0.75,0.8,0.3])print('AUC score:',roc_auc_score(y_true,y_scores))## 7.KS值,在实际操作时往往使用ROC曲线配合求出KS值from sklearn.metrics import roc_curvey_pred = [1,0,0,0,1,0,1,1,0,1]y_true = [0,1,1,1,0,0,0,0,0,1]FPR,TPR,thresholds = roc_curve(y_true,y_pred)KS = abs(FPR-TPR).max()print('KS值',KS)
代码运行结果如下: