1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 零基础入门金融风控-贷款违约预测-数据分析

零基础入门金融风控-贷款违约预测-数据分析

时间:2023-10-16 04:56:55

相关推荐

零基础入门金融风控-贷款违约预测-数据分析

1.数据分析

查看并初步了解数据,熟悉数据,为后续的特征工程做准备,主要目的如下:

1.EDA价值主要在于熟悉整个数据的基本情况(取值类型、取值类别、取值范围、缺失值、异常值等),对数据集进行分析是否可以进一步进行建模分析;

2.了解各变量间的相互关系、变量与预测值之间的存在关系;

3.为特征工程做准备;

2.主要分析内容

2.1 数据总体了解:

读取数据集并了解数据集大小,原始特征维度;

通过info熟悉数据类型;

粗略查看数据集中各特征基本统计量;

2.2 缺失值和唯一值:

查看数据缺失值情况;

查看唯一值特征情况;

2.3 深入数据-查看数据类型:

类别型数据;

数值型数据;

离散数值型数据

连续数值型数据

2.4 数据间相关关系:

特征和特征之间关系;

特征和目标变量之间关系;

2.5 用pandas_profiling生成数据报告

3.代码实例

3.1 导入相关库

# 导入相关库import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport datetimeimport warningswarnings.filterwarnings('ignore')pd.options.display.max_columns = Nonepd.options.display.max_rows = None%matplotlib inlinepd.set_option('display.max_colwidth', -1)

3.2 读取文件数据

# 读取数据train_data = pd.read_csv('./RawData/train.csv')test_data = pd.read_csv('./RawData/testA.csv')

3.3 初步了解

# 查看数据训练数据train_data.head()

# 查看训练数据字段train_data.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'isDefault','purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14'],dtype='object')

# 测试数据集字段test_data.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'purpose','postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow','ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal','revolUtil', 'totalAcc', 'initialListStatus', 'applicationType','earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3','n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],dtype='object')

# 数据维度print(train_data.shape)print(test_data.shape)

(800000, 47)(200000, 46)

训练数据集比测试数据集多了一个变量isDefault,即为用户是否违约的标签,其余变量作为输入变量;

数据总量:训练集-800000,测试集-200000

# 变量类型和空值train_data.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 800000 entries, 0 to 799999Data columns (total 47 columns):id800000 non-null int64loanAmnt 800000 non-null float64term 800000 non-null int64interestRate800000 non-null float64installment 800000 non-null float64grade 800000 non-null objectsubGrade 800000 non-null objectemploymentTitle 799999 non-null float64employmentLength753201 non-null objecthomeOwnership 800000 non-null int64annualIncome800000 non-null float64verificationStatus 800000 non-null int64issueDate 800000 non-null objectisDefault 800000 non-null int64purpose800000 non-null int64postCode 799999 non-null float64regionCode 800000 non-null int64dti 799761 non-null float64delinquency_2years 800000 non-null float64ficoRangeLow800000 non-null float64ficoRangeHigh 800000 non-null float64openAcc800000 non-null float64pubRec800000 non-null float64pubRecBankruptcies 799595 non-null float64revolBal 800000 non-null float64revolUtil 799469 non-null float64totalAcc 800000 non-null float64initialListStatus800000 non-null int64applicationType 800000 non-null int64earliesCreditLine800000 non-null objecttitle 799999 non-null float64policyCode 800000 non-null float64n0759730 non-null float64n1759730 non-null float64n2759730 non-null float64n3759730 non-null float64n4766761 non-null float64n5759730 non-null float64n6759730 non-null float64n7759730 non-null float64n8759729 non-null float64n9759730 non-null float64n10 766761 non-null float64n11 730248 non-null float64n12 759730 non-null float64n13 759730 non-null float64n14 759730 non-null float64dtypes: float64(33), int64(9), object(5)memory usage: 286.9+ MB

训练数据集数据总量为800000,某些属性的数据为空

employmentTitle(就业职称):float64,1个空值;employmentLength(就业年限(年)):object,非数值型<字符串>,多个空值;postCode(借款人在贷款申请中提供的邮政编码的前3位数字):float64,1个空值;dti(债务收入比):float64,多个空值;pubRecBankruptcies(公开记录清除的数量):float64,多个空值;revolUtil(循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额):float64,多个空值;title(借款人提供的贷款名称):float64,1个空值;n0-n14(计数特征):float64,均有大量空值;

同数据中也有一些非数值型数据:

grade(贷款等级)、subGrade(贷款等级之子级)、employmentLength(就业年限(年))、issueDate(贷款发放的月份)、earliesCreditLine(借款人最早报告的信用额度开立的月份)

# 查看具体的空值变量及其数量train_data.isnull().sum()

id0 loanAmnt 0 term 0 interestRate0 installment 0 grade 0 subGrade 0 employmentTitle 1 employmentLength46799homeOwnership 0 annualIncome0 verificationStatus 0 issueDate 0 isDefault 0 purpose0 postCode 1 regionCode 0 dti 239 delinquency_2years 0 ficoRangeLow0 ficoRangeHigh 0 openAcc0 pubRec0 pubRecBankruptcies 405 revolBal 0 revolUtil 531 totalAcc 0 initialListStatus0 applicationType 0 earliesCreditLine0 title 1 policyCode 0 n040270n140270n240270n340270n433239n540270n640270n740270n840271n940270n10 33239n11 69752n12 40270n13 40270n14 40270dtype: int64

大致查看数据中各变量的统计量

train_data.describe()

从上面的初步分析中可知:

1. 训练数据集数据总量800000,变量个数47;测试数据集数据总量200000,变量个数46;其中isDefault作为训练数据集的标签;

2. 训练数据集中共有22个变量含有空值,其中空值数量最多为69752,占数据总量8.7%;

3.从isDefault的统计量来看,数据标签不均衡,取0的值更多;

4.变量中有多个连续变量和一些分类变量;

3.4 查看具体的缺失值和缺失率

# NAN空值可视化显示missing = train_data.isnull().sum()/len(train_data)missing = missing[missing > 0]missing.sort_values(inplace = True)missing.plot.bar()

纵向了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于查看某一列nan存在的个数是否真的很大,如果nan存在的过多,说明这一列对label的影响几乎不起作用了,可以考虑删掉。如果缺失值很小一般可以选择填充。

另外可以横向比较,如果在数据集中,某些样本数据的大部分列都是缺失的且样本足够的情况下可以考虑删除。

# 查看数据中是否只有唯一取值的变量one_value_feature = [col for col in train_data.columns if train_data[col].nunique() <= 1]one_value_feature_test = [col for col in test_data.columns if test_data[col].nunique() <= 1]print('train_data : the feature of only having one value is ',one_value_feature)print('test_data: the feature of only having one value is ', one_value_feature_test)

train_data : the feature of only having one value is ['policyCode']test_data: the feature of only having one value is ['policyCode']

3.5 查看特征的类型

特征一般分为类别性特征和数值型特征,数值型特征又分为连续性和离散型;

类别特征有时具有非数值型关系,有时具有数值型关系;

数值型特征本是可以直接入模的,但往往风控人员要对其进行分箱,转化为WOE编码进而做标准评分卡等操作。从模型效果上来看,特征分箱主要是为了降低模型的复杂性,较少变量噪声对模型的影响,提高自变量和因变量的相关度,从而使模型更加稳定;

# 数值型特征numerical_features = list(train_data.select_dtypes(exclude = ['object']).columns)# 类别型特征category_features = list(filter(lambda x:x not in numerical_features,list(train_data.columns)))print('numerical_features : ', numerical_features)print('category_features : ', category_features)

numerical_features : ['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']category_features : ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

对于数值型变量,查看连续性变量和离散型变量,通过统计变量的取值情况来判断变量是连续还是离散,如果变量的取值种类小于10中,则认为其是离散的,否则认为是连续;

def get_numerical_serial_features(data, features):numerical_serial_features = []numerical_noserial_features = []for feature in features:temp = data[feature].nunique()if temp <= 10:numerical_noserial_features.append(feature)continuenumerical_serial_features.append(feature)return numerical_serial_features, numerical_noserial_featuresnumerical_serial_features, numerical_noserial_features = get_numerical_serial_features(train_data,numerical_features)print('numerical_serial_features: ', numerical_serial_features)print('numerical_noserial_features: ', numerical_noserial_features)

numerical_serial_features: ['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']numerical_noserial_features: ['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']

3.6 查看数值类别行特征的取值情况

for feature in numerical_noserial_features:print('****************************')print(feature, ': \n', train_data[feature].value_counts())

****************************term : 3 6069025 193098Name: term, dtype: int64****************************homeOwnership : 0 3957321 3176602 86309 3 185 5 81 4 33 Name: homeOwnership, dtype: int64****************************verificationStatus : 1 3098102 2489680 241222Name: verificationStatus, dtype: int64****************************isDefault : 0 6403901 159610Name: isDefault, dtype: int64****************************initialListStatus : 0 4664381 333562Name: initialListStatus, dtype: int64****************************applicationType : 0 7845861 15414 Name: applicationType, dtype: int64****************************policyCode : 1.0 800000Name: policyCode, dtype: int64****************************n11 : 0.0 7296821.0 540 2.0 24 4.0 13.0 1Name: n11, dtype: int64****************************n12 : 0.0 7573151.0 2281 2.0 115 3.0 16 4.0 3Name: n12, dtype: int64

数值连续性特征分析

# 对每个数值连续性特征可视化f = pd.melt(train_data, value_vars = numerical_serial_features)g = sns.FacetGrid(f, col = 'variable', col_wrap = 3, sharex = False, sharey = False)g = g.map(sns.distplot, "value")

查看某一个数值连续性变量的分布,看其是否符合正态分布,如果不符合正态分布的变量可以log后再观察其是否符合正态分布;

如果想要统一处理一批数据变标准化,必须吧这些之前已经正态化的数据提出;

正态化的原因:一些情况下正态可以让模型更快的收敛,一些模型要求数据正态(如GMM、KNN),保证数据不要过偏态即可,过偏态可能会影响模型预测结果;

3.7 类别性变量分析

for feature in category_features:print('********************')print(feature, ':\n', train_data[feature].value_counts())

********************grade :B 233690C 227118A 139661D 119453E 55661 F 19053 G 5364 Name: grade, dtype: int64********************subGrade :C1 50763B4 49516B5 48965B3 48600C2 47068C3 44751C4 44272B2 44227B1 42382C5 40264A5 38045A4 30928D1 30538D2 26528A1 25909D3 23410A3 22655A2 22124D4 21139D5 17838E1 14064E2 12746E3 10925E4 9273 E5 8653 F1 5925 F2 4340 F3 3577 F4 2859 F5 2352 G1 1759 G2 1231 G3 978 G4 751 G5 645 Name: subGrade, dtype: int64********************employmentLength :10+ years 2627532 years72358 < 1 year64237 3 years64152 1 year 52489 5 years50102 4 years47985 6 years37254 8 years36192 7 years35407 9 years30272 Name: employmentLength, dtype: int64********************issueDate :-03-01 29066-10-01 25525-07-01 24496-12-01 23245-10-01 21461-02-01 20571-11-01 19453-01-01 19254-04-01 18929-08-01 18750-05-01 17119-01-01 16792-07-01 16355-06-01 15236-09-01 14950-04-01 14248-11-01 13793-03-01 13549-08-01 13301-02-01 12881-07-01 12835-06-01 12270-12-01 11562-10-01 11245-11-01 11172-05-01 10886-04-01 10830-05-01 10680-08-01 10648-09-01 10165-03-01 10068-01-01 9757 -06-01 9665 -03-01 9645 -05-01 9620 -01-01 9273 -08-01 9172 -02-01 9105 -06-01 9005 -12-01 8948 -07-01 8861 -11-01 8748 -10-01 8409 -09-01 8100 -02-01 8057 -04-01 7746 -09-01 7733 -08-01 7490 -11-01 7306 -10-01 7129 -07-01 7052 -06-01 6424 -05-01 6116 -12-01 5915 -09-01 5898 -04-01 5627 -12-01 5528 -01-01 5176 -03-01 4918 -02-01 4462 -03-01 4228 -04-01 4160 -01-01 4016 -02-01 3995 -05-01 3933 -11-01 3849 -10-01 3693 -09-01 3661 -12-01 3551 -08-01 3265 -06-01 2878 -07-01 2774 -07-01 2550 -06-01 2299 -08-01 2108 -05-01 1980 -04-01 1951 -03-01 1740 -01-01 1566 -02-01 1566 -09-01 1427 -11-01 1343 -12-01 1310 -10-01 1258 -10-01 1252 -09-01 1238 -08-01 1139 -07-01 1096 -06-01 1087 -05-01 1019 -11-01 962 -04-01 917 -01-01 855 -03-01 850 -02-01 812 -12-01 765 -12-01 746 -08-01 677 -10-01 670 -07-01 654 -11-01 646 -09-01 623 -06-01 600 -05-01 578 -04-01 481 -03-01 418 -02-01 394 -11-01 376 -12-01 362 -01-01 355 -10-01 305 -09-01 270 -08-01 231 -07-01 223 -06-01 191 -05-01 190 -04-01 166 -03-01 162 -02-01 160 -01-01 145 -12-01 134 -03-01 130 -11-01 113 -02-01 105 -04-01 92 -01-01 91 -10-01 62 -12-01 55 -07-01 52 -05-01 38 -08-01 38 -06-01 33 -10-01 26 -11-01 24 -08-01 23 -07-01 21 -09-01 19 -09-01 7 -06-01 1 Name: issueDate, dtype: int64********************earliesCreditLine :Aug-2001 5567Aug-2002 5403Sep- 5403Oct-2001 5258Aug-2000 5246Sep- 5219Sep-2002 5170Aug- 5116Oct-2002 5034Oct-2000 5034Oct- 4969Aug- 4904Nov-2000 4798Sep-2001 4787Sep-2000 4780Nov-1999 4773Oct-1999 4678Oct- 4647Sep- 4608Jul- 4586Nov-2001 4514Aug- 4494Jul-2001 4480Aug-1999 4446Sep-1999 4441Dec-2001 4379Jul-2002 4342Aug- 4283Mar-2001 4268May-2001 4223Nov-2002 4213Nov- 4201Jul- 4180Jun-2001 4173Nov- 4163Dec-2000 4159Mar- 4118May- 4090Sep-1998 4064Dec-1999 4058May-2002 4040Jul-2000 4036Mar-2000 4035Apr-2001 4015Oct-1998 3999Jun-2000 3962Apr-2002 3950Jan-2001 3933Mar-2002 3931Mar- 3929Dec- 3901Jun- 3872Apr- 3858Dec-2002 3856Oct- 3845May-2000 3844May- 3832Jun- 3784Jan-2002 3761Dec- 3757Nov-1998 3746Dec-1998 3740Aug-1998 3730Mar-1999 3724Jun- 3704Feb-2002 3686Feb-2000 3685Jun-2002 3674Sep- 3646Apr- 3633Nov- 3627Jul-1999 3627Feb-2001 3597Oct-1997 3568Feb- 3549Mar- 3529Jan- 3528Jan-2000 3524Apr-2000 3521Jul- 3441Feb- 3431May- 3427Jun-1999 3406Jan- 3397Feb- 3378Sep-1997 3365Apr- 3339Nov-1997 3330Oct-1995 3325Jan- 3312Aug- 3309May-1999 3308Oct-1996 3245Mar- 3244Apr-1999 3241Sep-1995 3217Dec- 3204Feb-1999 3202May- 3162Jun- 3129Nov-1996 3115Oct- 3097Dec-1997 3089Jul- 3064Sep-1996 3061Nov-1995 3018Jan-1999 3013Jul-1998 2999Apr- 2968Jun-1998 2940Jan- 2939Aug-1996 2922May-1998 2913Feb- 2908Aug-1997 2884Nov- 2870Aug-1995 2837Apr-1998 2811Oct-1994 2781Mar-1996 2724Mar-1998 2713Dec-1996 2712Sep-1994 2673Jan-1998 2668Sep- 2656Nov-1994 2654May-1996 2631Dec- 2614Nov-1993 2568Feb-1998 2527Mar- 2512Dec-1995 2506Jun-1996 2506Apr-1997 2505Aug-1994 2497Jul-1997 2480Dec-1993 2470Mar-1997 2469Jun-1995 2457Jun-1997 2456Jul- 2456Dec-1994 2454Jan- 2439Feb-1996 2405Apr-1996 2401Jul-1996 2390May-1997 2382May- 2368Mar-1995 2366May-1995 2363Jan-1996 2353Feb- 2325Jan-1997 2317Mar-1994 2285Apr- 2281Jun- 2255Feb-1997 2246Jul-1995 2208Oct-1993 2194Oct- 2183Apr-1995 2137Jun-1994 2132Jul-1994 2108Sep-1993 2086Aug- 2082Feb-1995 2062May-1994 2033Jan-1995 Feb-1994 1963Apr-1994 1929Nov- 1917Aug-1993 1872Jan-1994 1823Dec- 1763Jul-1993 1745Sep- 1733Apr-1993 1721Mar-1993 1718May-1993 1713Feb- 1673Dec-1992 1667Jan- 1666Oct-1992 1633Jun-1993 1622Mar- 1597Nov-1992 1573Jul- 1570Apr- 1566Aug- 1543Sep-1992 1499Jun- 1495May- 1439Oct-1990 1431Aug-1990 1424Aug- 1422Oct-1991 1421Oct-1989 1407Feb-1993 1394Nov-1990 1366May-1990 1358Sep- 1347Sep-1990 1347Oct- 1339Apr-1990 1333Aug-1992 1325Mar-1990 1301Aug-1991 1295Sep- 1295Nov-1991 1294Jan-1993 1292Aug-1989 1274Jul-1990 1272Mar-1991 1269Sep-1991 1263Jul-1992 1258Feb-1992 1246Mar-1992 1243Mar-1989 1238Nov-1989 1236Jun-1990 1216Jun-1989 1203Apr-1991 1192Feb-1991 1191Dec-1989 1189Aug- 1188Apr-1989 1182Oct- 1181Dec-1991 1181Jan-1990 1177Dec-1990 1176May-1991 1168Apr-1992 1158Jul-1991 1148Feb-1990 1148Jun-1991 1140May-1989 1138Jun-1992 1137Jul-1989 1131Jan-1991 1125Nov-1988 1108May-1992 1100Nov- 1097Jan-1989 1088Sep-1989 1074Mar-1988 1061Apr-1988 1054Jan-1992 1053Oct-1988 1045Aug-1988 1042Sep- 1008Sep-1988 1007Feb-1989 1003Dec- 995 Oct-1987 984 Nov-1987 976 Oct- 975 Jul-1988 972 Dec-1988 969 Sep-1987 936 Jul-1987 929 Jun-1987 923 May-1988 920 Mar-1987 906 Feb-1988 904 Oct-1986 891 Jun-1988 891 Mar- 890 Aug-1987 887 Sep-1986 887 Jul- 873 Jan- 867 Nov- 862 Nov- 858 Dec-1987 855 Jul- 853 Dec- 851 Apr- 848 Jan-1988 840 Oct- 836 Apr-1987 827 May- 819 Aug-1986 818 Oct-1985 809 May-1987 807 May- 804 Apr-1986 800 Feb-1987 797 Jul-1986 790 Mar- 786 Jan- 782 Feb- 778 May-1986 772 Mar-1986 767 Nov-1986 765 Jan-1987 764 Dec- 764 Jul- 762 Dec-1986 758 Jun- 754 Feb- 753 Apr- 753 Jun- 750 May- 749 Oct-1984 748 Jun- 740 Jan- 739 Mar-1985 739 Feb- 731 Jun-1986 713 Nov- 707 Aug- 694 Apr- 692 Aug-1985 691 Mar- 686 May-1985 667 Jan-1985 665 Apr-1985 664 Jul-1985 659 Aug-1984 657 Feb- 656 Nov-1985 650 Mar- 643 Dec-1985 643 Jun-1985 641 May-1984 641 Jan-1986 640 Mar-1984 639 Sep-1985 634 Apr-1984 632 Jan- 632 Nov-1984 621 Jul-1984 608 Feb-1986 604 Dec-1984 601 Feb-1985 597 Sep- 575 Dec- 573 Sep-1984 573 May- 563 Jun-1984 558 May-1983 540 Apr- 537 Oct-1983 531 Nov-1983 526 Jun- 526 Feb-1984 520 Mar-1983 515 Jul- 514 Sep-1983 512 Jun-1983 508 Aug-1983 506 Oct- 490 Jan-1984 488 Apr-1983 485 Dec-1983 479 Jan-1983 463 Nov- 450 Jul-1983 448 Oct-1982 448 Dec-1982 447 Mar-1982 418 Nov-1982 400 Feb-1983 400 Apr-1982 393 Mar- 376 Sep-1982 367 Jul-1982 360 Aug-1982 356 Aug- 356 Dec- 348 Feb- 348 Jun-1982 345 Jan-1982 345 Jan- 344 Apr- 340 Mar-1981 338 Feb-1982 338 May-1982 318 Oct-1981 313 May- 311 Sep-1981 294 Oct-1978 292 Mar-1979 288 Nov-1981 287 Jan-1981 286 Aug-1981 286 Oct-1979 279 Dec-1981 278 Sep- 278 Apr-1981 277 Jan-1980 276 May-1981 267 Jun-1981 264 Nov-1979 264 Oct- 264 Apr-1979 263 Dec-1980 262 Oct-1980 262 Mar-1978 262 Mar-1980 256 Aug-1978 255 Jul- 254 Sep-1978 253 Jul-1981 250 Sep-1980 249 Nov-1980 248 Jun- 247 Jan-1979 246 Apr-1978 244 Jan-1978 243 Sep-1979 240 Aug-1979 233 Dec-1977 232 Jul-1979 231 Feb-1979 231 Feb-1981 231 Feb-1978 231 Aug-1980 228 May-1979 228 May-1978 227 Dec-1979 227 Sep-1977 224 Jul-1978 224 Apr-1980 224 Nov-1977 223 Nov- 218 Jun-1978 217 Mar-1977 215 Jun-1979 214 Feb-1980 214 Dec-1978 214 Nov-1978 210 Oct-1977 209 Jan-1977 205 Aug-1977 205 Jul-1980 192 Jul-1977 191 May-1980 191 Jan-1976 190 Jan- 190 Dec- 189 Feb- 189 Apr-1977 183 Jun-1977 178 May-1977 176 Dec-1976 170 Oct-1976 169 Nov-1976 165 Jun-1976 158 Jun-1980 158 Mar-1976 156 Mar- 153 Feb-1977 152 Oct-1975 151 Apr-1976 150 Apr- 150 May-1976 147 Sep-1976 147 Oct-1974 146 Aug-1976 139 Sep-1975 137 May- 137 Nov-1975 131 Apr-1975 130 Jun- 126 Dec-1975 125 Aug- 121 Jul-1976 121 Mar-1975 118 Jan-1975 114 Jun-1975 112 Aug-1975 112 Jan-1973 110 Feb-1975 109 Apr-1974 109 Oct-1973 109 Aug-1974 109 Mar-1974 108 Feb-1976 108 Jun-1974 108 Jan-1974 106 Sep- 104 Dec-1973 104 Sep-1974 102 Oct-1972 102 Mar-1973 101 Jul- 101 Nov-1973 100 May-1975 100 Apr-1973 98 May-1973 97 Nov-1972 97 Dec-1974 94 Feb-1973 93 May-1974 93 Nov-1974 90 Feb-1974 89 Jul-1975 88 Feb-1972 88 Oct- 88 Mar-1972 82 Apr-1972 80 Jun-1973 80 Sep-1973 80 Jan-1972 80 Jul-1973 79 Aug-1973 78 Sep-1972 76 Aug-1972 75 May-1972 74 Jul-1974 72 Jul-1972 72 Dec-1971 71 Dec-1972 70 Jun-1972 68 Oct-1970 67 Nov- 66 Jan-1971 64 Nov-1971 62 Nov-1969 60 Mar-1970 59 Jun-1971 58 Apr-1971 57 Jan-1970 57 Oct-1971 57 Jun-1969 56 Jan-1967 56 Dec-1969 55 Jan-1968 55 Mar-1969 54 May-1970 53 Mar-1971 53 Jul-1971 53 Apr-1970 51 Dec- 50 Oct-1969 50 Jan-1969 49 Aug-1969 49 Aug-1970 48 Feb- 48 Aug-1967 47 Jan- 46 Jul-1968 46 Dec-1970 46 Aug-1971 46 Apr-1969 46 Feb-1970 46 Sep-1971 44 Jun-1970 44 Feb-1968 43 Feb-1969 43 Feb-1971 43 Sep-1969 42 Sep-1970 42 Nov-1970 42 Nov-1967 41 Mar-1968 40 May-1969 40 Jan-1965 39 Jul-1969 38 Apr-1968 37 Mar- 37 May-1971 35 Sep-1967 35 Dec-1968 34 Jan-1963 34 Oct-1968 33 Jul-1970 33 May-1968 32 Sep-1968 29 Apr- 29 Jun-1968 29 Apr-1967 29 Dec-1967 29 Jan-1964 28 Jun-1966 28 Nov-1968 28 Jan-1966 28 Apr-1966 28 Jun-1967 27 May-1967 27 Oct-1967 26 Jan-1960 26 Jan-1962 26 Aug-1968 25 Sep-1966 25 Nov-1966 25 Aug-1966 24 Jul-1967 23 Jun-1965 23 Feb-1967 23 Apr-1965 21 Dec-1965 21 Mar-1967 19 Jul-1965 19 Sep-1965 19 Apr-1964 18 Nov-1965 18 Feb-1965 17 Dec-1966 17 Sep-1964 17 Oct-1964 17 Jan-1961 17 May-1966 17 May-1964 16 Dec-1964 16 Oct-1965 15 Aug-1965 15 Jul-1966 15 Mar-1965 14 Mar-1966 13 Jul-1963 13 May-1965 13 Jul- 12 Feb-1964 12 Apr-1963 12 Nov-1964 11 Oct-1963 11 Oct-1962 11 Jun- 10 Jan-1959 10 Feb-1966 10 Mar-1964 10 Jun-1963 10 Dec-1962 9 Jan-1958 9 Sep-1962 9 Aug-1964 9 Jun-1964 8 Mar-1963 8 May- 8 Oct-1966 8 Jul-1964 8 Dec-1963 7 Jan-1957 7 Oct-1961 7 Aug- 7 May-1962 6 Sep-1963 6 Jun-1962 6 Mar-1960 5 Oct-1960 5 Aug-1960 5 Aug-1959 5 Aug-1963 5 Jul-1962 4 Jan-1956 4 Nov-1959 4 Nov-1961 4 Jun-1957 4 Oct-1959 4 Dec-1961 4 Apr-1961 4 Nov-1963 4 Jan-1952 4 Aug-1961 4 Apr-1962 4 May-1963 4 Jun-1959 4 Mar-1961 4 Nov-1960 4 Dec-1959 4 Jan-1951 4 Dec-1958 3 Jun-1960 3 May-1959 3 Jul-1960 3 Jun-1961 3 Jan-1954 3 Jan-1950 3 Sep- 3 Jan-1953 3 Nov-1956 3 Feb-1963 3 Jan-1955 3 Aug-1962 3 Feb-1959 2 Sep-1960 2 Sep-1959 2 Sep-1961 2 Nov-1958 2 Apr-1960 2 Jul-1951 2 Feb-1962 2 Jul-1958 2 Jul-1959 2 Jul-1961 2 Jun-1952 2 May-1955 2 Nov-1962 2 Oct-1958 2 Aug-1950 2 Dec-1956 2 May-1958 2 Sep-1956 2 Apr-1955 2 Feb-1961 2 Mar-1957 1 Apr-1958 1 Jun-1958 1 Jul-1955 1 May-1960 1 Aug-1955 1 May-1957 1 Oct-1954 1 Dec-1951 1 Sep-1953 1 Jan-1944 1 Oct- 1 Feb-1960 1 Mar-1958 1 Nov-1953 1 Mar-1962 1 Jan-1946 1 Oct-1957 1 Aug-1946 1 Sep-1957 1 Nov-1954 1 Dec-1960 1 Aug-1958 1 Name: earliesCreditLine, dtype: int64

# 对其进行可视化显示for feature in category_features:plt.figure(figsize=(8,8))sns.barplot(train_data[feature].value_counts(dropna=False)[:20],train_data[feature].value_counts(dropna=False).keys()[:20])plt.show

3.8 根据标签类别可视化某一特征

train_loan_fraud = train_data.loc[train_data['isDefault'] == 1]train_loan_nofraud = train_data.loc[train_data['isDefault'] == 0]

# 查看类别型变量在不同标签y之上的分布fig, axes = plt.subplots(len(category_features) - 2,2, figsize=(15, 12))i_axes = 0for feature in category_features:if feature == 'issueDate' or feature == 'earliesCreditLine':continuetrain_loan_fraud.groupby(feature)[feature].count().plot(kind='barh',ax=axes[i_axes,0],title='Count of '+feature+' fraud')train_loan_nofraud.groupby(feature)[feature].count().plot(kind='barh',ax=axes[i_axes,1],title='Count of '+feature+' nofraud')i_axes += 1plt.show()

# 查看连续性变量在不同y值上的分布fig, axes = plt.subplots(len(numerical_serial_features)-2,2,figsize=(15,80))i_axes = 0for feature in numerical_serial_features:if feature == 'isDefault' or feature == 'id':continuetry:train_loan_fraud[feature].apply(np.log).plot(kind='hist',bins=100,title='Log'+feature+' - fraud',color='r',xlim=(-3,10),ax=axes[i_axes,0])train_loan_nofraud[feature].apply(np.log).plot(kind='hist',bins=100,title='Log'+feature+' - nofraud',color='b',xlim=(-3,10),ax=axes[i_axes,1])except Exception as e:print('***************************************************')print(e)print(feature)i_axes += 1

***************************************************autodetected range of [-inf, 12.843577615174548] is not finiteemploymentTitle***************************************************autodetected range of [-inf, 15.979589849307954] is not finiteannualIncome***************************************************autodetected range of [-inf, 2.5649493574615367] is not finitepurpose***************************************************autodetected range of [-inf, 6.842683282238422] is not finitepostCode***************************************************autodetected range of [-inf, 3.91005428146] is not finiteregionCode***************************************************autodetected range of [-inf, 6.906754778648554] is not finitedti***************************************************autodetected range of [-inf, 3.295836866004329] is not finitedelinquency_2years***************************************************autodetected range of [-inf, 4.330733340286331] is not finiteopenAcc***************************************************autodetected range of [-inf, 4.454347296253507] is not finitepubRec***************************************************autodetected range of [-inf, 2.1972245773362196] is not finitepubRecBankruptcies***************************************************autodetected range of [-inf, 14.373248011505062] is not finiterevolBal***************************************************autodetected range of [-inf, 5.194622130209272] is not finiterevolUtil***************************************************autodetected range of [-inf, 11.029666368922044] is not finitetitle***************************************************autodetected range of [-inf, 3.5263605246161616] is not finiten0***************************************************autodetected range of [-inf, 3.4011973816621555] is not finiten1***************************************************autodetected range of [-inf, 3.784189633918261] is not finiten2***************************************************autodetected range of [-inf, 3.784189633918261] is not finiten3***************************************************autodetected range of [-inf, 3.7376696182833684] is not finiten4***************************************************autodetected range of [-inf, 4.02535169073515] is not finiten5***************************************************autodetected range of [-inf, 4.852030263919617] is not finiten6***************************************************autodetected range of [-inf, 4.276666119016055] is not finiten7***************************************************autodetected range of [-inf, 3.784189633918261] is not finiten9***************************************************autodetected range of [-inf, 4.31748811353631] is not finiten10***************************************************autodetected range of [-inf, 3.258096538021482] is not finiten13***************************************************index 31 is out of bounds for axis 0 with size 31n14

3.9 时间日期查看及处理

#转化成时间格式 issueDateDT特征表示数据日期离数据集中日期最早的日期(-06-01)的天数train_data['issueDate'] = pd.to_datetime(train_data['issueDate'],format='%Y-%m-%d')startdate = datetime.datetime.strptime('-06-01', '%Y-%m-%d')train_data['issueDateDT'] = train_data['issueDate'].apply(lambda x: x-startdate).dt.days

#转化成时间格式test_data['issueDate'] = pd.to_datetime(train_data['issueDate'],format='%Y-%m-%d')startdate = datetime.datetime.strptime('-06-01', '%Y-%m-%d')test_data['issueDateDT'] = test_data['issueDate'].apply(lambda x: x-startdate).dt.days

plt.hist(train_data['issueDateDT'], label='train');plt.hist(test_data['issueDateDT'], label='test');plt.legend();plt.title('Distribution of issueDateDT dates');#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的

3.10 透视图的使用

#透视图 索引可以有多个,“columns(列)”是可选的,聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。pivot = pd.pivot_table(train_data, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)pivot

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。