金融风控-贷款违约预测 来源:金融风控-贷款违约预测
赛题以预测用户贷款是否违约为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取80万条作为训练集,20万条作为测试集A,20万条作为测试集B,同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns import datetimeimport warningswarnings.filterwarnings('ignore' ) data_tr = pd.read_csv('data/金融贷款数据/train.csv' ) data_tt = pd.read_csv('data/金融贷款数据/testA.csv' )
查看数据规模 print ('Train data shape:' ,data_tr.shape)print ('TestA data shape:' ,data_tt.shape)
Train data shape: (800000, 47)
TestA data shape: (200000, 46)
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
查看数据前10行
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership ... n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 0 0 35000.0 5 19.52 917.97 E E2 320.0 2 years 2 ... 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0 1 1 18000.0 5 18.49 461.90 D D2 219843.0 5 years 0 ... NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN 2 2 12000.0 5 16.99 298.17 D D3 31698.0 8 years 0 ... 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0 3 3 11000.0 3 7.26 340.96 A A4 46854.0 10+ years 1 ... 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0 4 4 3000.0 3 12.99 101.07 C C2 54.0 NaN 1 ... 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0 5 5 11000.0 3 7.99 344.65 A A5 51727.0 7 years 0 ... 1.0 48.0 2.0 3.0 2.0 19.0 0.0 0.0 0.0 0.0 6 6 2050.0 3 7.69 63.95 A A4 180083.0 9 years 0 ... 11.0 3.0 10.0 18.0 3.0 12.0 0.0 0.0 0.0 3.0 7 7 11500.0 3 14.98 398.54 C C3 214017.0 1 year 1 ... 16.0 10.0 5.0 21.0 4.0 8.0 0.0 0.0 0.0 2.0 8 8 12000.0 3 12.99 404.27 C C2 188.0 5 years 2 ... 7.0 2.0 13.0 17.0 11.0 15.0 NaN 0.0 0.0 6.0 9 9 6500.0 3 10.99 212.78 B B4 54.0 NaN 1 ... 21.0 24.0 6.0 39.0 5.0 7.0 0.0 0.0 0.0 8.0
10 rows × 47 columns
🌟 查看数据类型 挺重要的,要将object类型的数据转换成float
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
查看数据分布
id loanAmnt term interestRate installment employmentTitle homeOwnership annualIncome verificationStatus isDefault ... n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 count 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 8.000000e+05 800000.000000 800000.000000 ... 759730.000000 759730.000000 759730.000000 759729.000000 759730.000000 766761.000000 730248.000000 759730.000000 759730.000000 759730.000000 mean 399999.500000 14416.818875 3.482745 13.238391 437.947723 72005.351714 0.614213 7.613391e+04 1.009683 0.199513 ... 8.107937 8.575994 8.282953 14.622488 5.592345 11.643896 0.000815 0.003384 0.089366 2.178606 std 230940.252013 8716.086178 0.855832 4.765757 261.460393 106585.640204 0.675749 6.894751e+04 0.782716 0.399634 ... 4.799210 7.400536 4.561689 8.124610 3.216184 5.484104 0.030075 0.062041 0.509069 1.844377 min 0.000000 500.000000 3.000000 5.310000 15.690000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 199999.750000 8000.000000 3.000000 9.750000 248.450000 427.000000 0.000000 4.560000e+04 0.000000 0.000000 ... 5.000000 4.000000 5.000000 9.000000 3.000000 8.000000 0.000000 0.000000 0.000000 1.000000 50% 399999.500000 12000.000000 3.000000 12.740000 375.135000 7755.000000 1.000000 6.500000e+04 1.000000 0.000000 ... 7.000000 7.000000 7.000000 13.000000 5.000000 11.000000 0.000000 0.000000 0.000000 2.000000 75% 599999.250000 20000.000000 3.000000 15.990000 580.710000 117663.500000 1.000000 9.000000e+04 2.000000 0.000000 ... 11.000000 11.000000 10.000000 19.000000 7.000000 14.000000 0.000000 0.000000 0.000000 3.000000 max 799999.000000 40000.000000 5.000000 30.990000 1715.420000 378351.000000 5.000000 1.099920e+07 2.000000 1.000000 ... 70.000000 132.000000 79.000000 128.000000 45.000000 82.000000 4.000000 4.000000 39.000000 30.000000
8 rows × 42 columns
查看数据集中的特征缺失值,唯一值等 缺失值 print (f'There are {data_tr.isnull().any ().sum ()} columns in train dataset with missing values.' )
There are 22 columns in train dataset with missing values.
查看缺失率大于50%的特征
have_null_fea_dict = (data_tr.isnull().sum ()/ len (data_tr)).to_dict() fea_null_moreThanHalf = {} for key, value in have_null_fea_dict.items(): if value > 0.5 : fea_null_moreThanHalf[key] = value
{}
具体的查看缺失特征及缺失率
missing = data_tr.isnull().sum () / len (data_tr) missing = missing[missing > 0 ]
employmentTitle 0.000001
employmentLength 0.058499
postCode 0.000001
dti 0.000299
pubRecBankruptcies 0.000506
revolUtil 0.000664
title 0.000001
n0 0.050338
n1 0.050338
n2 0.050338
n3 0.050338
n4 0.041549
n5 0.050338
n6 0.050338
n7 0.050338
n8 0.050339
n9 0.050338
n10 0.041549
n11 0.087190
n12 0.050338
n13 0.050338
n14 0.050338
dtype: float64
missing.sort_values(inplace=True ) missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x15c297ee0>
纵向了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于查看某一列nan存在的个数是否真的很大,如果nan存在的过多,说明这一列对label的影响几乎不起作用了,可以考虑删掉。如果缺失值很小一般可以选择填充。 另外可以横向比较,如果在数据集中,某些样本数据的大部分列都是缺失的且样本足够的情况下可以考虑删除。
Tips: 比赛大杀器lgb模型可以自动处理缺失值,Task4模型会具体学习模型了解模型哦! 查看数据值全是1的特征字段 one_value_fea = [col for col in data_tr.columns if data_tr[col].nunique() <= 1 ] one_value_fea_test = [col for col in data_tt.columns if data_tt[col].nunique() <= 1 ] one_value_fea
['policyCode']
data_tr.get('policyCode' )
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
...
799995 1.0
799996 1.0
799997 1.0
799998 1.0
799999 1.0
Name: policyCode, Length: 800000, dtype: float64
['policyCode']
print (f'There are {len (one_value_fea)} columns in train dataset with one unique value.' )print (f'There are {len (one_value_fea_test)} columns in test dataset with one unique value.' )
There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.
总结: 47列数据中有22列都缺少数据,这在现实世界中很正常。‘policyCode’具有一个唯一值(或全部缺失)。有很多连续变量和一些分类变量。
处理数据类型、对象类型 filter(function, iterable) filter函数用于过滤序列,过滤掉不符合条件的元素,返回一个迭代器对象,如果要转换为列表,可以使用 list() 来转换。
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
numerical_fea = list (data_tr.select_dtypes(exclude=['object' ]).columns)
category_fea = list (filter (lambda x: x not in numerical_fea, list (data_tr.columns))) category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
查看字段值 0 E
1 D
2 D
3 A
4 C
..
799995 C
799996 A
799997 C
799998 A
799999 B
Name: grade, Length: 800000, dtype: object
data_tr.get('earliesCreditLine' )
0 Aug-2001
1 May-2002
2 May-2006
3 May-1999
4 Aug-1977
...
799995 Aug-2011
799996 May-1989
799997 Jul-2002
799998 Jan-1994
799999 Feb-2002
Name: earliesCreditLine, Length: 800000, dtype: object
0 2014-07-01
1 2012-08-01
2 2015-10-01
3 2015-08-01
4 2016-03-01
...
799995 2016-07-01
799996 2013-04-01
799997 2015-10-01
799998 2015-02-01
799999 2018-08-01
Name: issueDate, Length: 800000, dtype: object
数值型变量分析 数值型变量分析,数值型肯定是包括连续型变量和离散型变量的,找出来然后 划分数值型变量中的连续变量和离散型变量
Pandas nunique() 用于获取唯一值的统计次数。
‘’’ 例子: // 统计“Team”列中不同值的个数,不包括null值 unique_value = data[“Team”].nunique() ‘’’
def get_numerical_serial_fea (data, feas ): numerical_serial_fea = [] numerical_noserial_fea = [] for fea in feas: temp = data.get(fea).nunique() if temp <= 10 : numerical_noserial_fea.append(fea) continue numerical_serial_fea.append(fea) return numerical_serial_fea, numerical_noserial_fea numerical_fea = list (data_tr.select_dtypes(exclude=['object' ]).columns) numerical_serial_fea, numerical_noserial_fea = get_numerical_serial_fea(data_tr, numerical_fea)
['term',
'homeOwnership',
'verificationStatus',
'isDefault',
'initialListStatus',
'applicationType',
'policyCode',
'n11',
'n12']
0 5
1 5
2 5
3 3
4 3
..
799995 3
799996 3
799997 3
799998 3
799999 3
Name: term, Length: 800000, dtype: int64
data_tr['term' ].value_counts()
3 606902
5 193098
Name: term, dtype: int64
data_tr['homeOwnership' ].value_counts()
0 395732
1 317660
2 86309
3 185
5 81
4 33
Name: homeOwnership, dtype: int64
data_tr['policyCode' ].value_counts()
1.0 800000
Name: policyCode, dtype: int64
data_tr['n11' ].value_counts()
0.0 729682
1.0 540
2.0 24
4.0 1
3.0 1
Name: n11, dtype: int64
['id',
'loanAmnt',
'interestRate',
'installment',
'employmentTitle',
'annualIncome',
'purpose',
'postCode',
'regionCode',
'dti',
'delinquency_2years',
'ficoRangeLow',
'ficoRangeHigh',
'openAcc',
'pubRec',
'pubRecBankruptcies',
'revolBal',
'revolUtil',
'totalAcc',
'title',
'n0',
'n1',
'n2',
'n3',
'n4',
'n5',
'n6',
'n7',
'n8',
'n9',
'n10',
'n13',
'n14']
''' pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None) 参数解释: frame: 要处理的数据集。 id_vars: 不需要被转换的列名。 value_vars: 需要转换的列名,如果剩下的列全部都要转换,就不用写了。 var_name和value_name是自定义设置对应的列名。 col_level: 如果列是MultiIndex,则使用此级别。 ''' ''' class seaborn.FacetGrid(data, row=None, col=None, hue=None, col_wrap=None, sharex=True, sharey=True, height=3, aspect=1, palette=None, row_order=None, col_order=None, hue_order=None, hue_kws=None, dropna=True, legend_out=True, despine=True, margin_titles=False, xlim=None, ylim=None, subplot_kws=None, gridspec_kws=None, size=None) Parameters data:DataFrame数据。 row, col, hue:字符串。 定义数据子集的变量,这些变量将在网格的不同方面绘制。 col_wrap:整形数值,可选参数。 以此参数值来限制网格的列维度,以便列面跨越多行。与row面不兼容。 share{x,y}:布尔值,'col' 或 'row'可选 如果为true,则跨列共享y轴或者跨行共享x轴。 height:标量,可选参数。 aspect:标量,可选参数。 每个图片的纵横比,因此aspect * height给出每个图片的宽度,单位为英寸。 palette:调色板名称,列表或字典,可选参数。 ''' f = pd.melt(data_tr, value_vars=numerical_serial_fea) g = sns.FacetGrid(f, col="variable" , col_wrap=2 , sharex=False , sharey=False ) g = g.map (sns.distplot, "value" )
查看某一个数值型变量的分布,查看变量是否符合正态分布,如果不符合正太分布的变量可以log化后再观察下是否符合正态分布。 如果想统一处理一批数据变标准化 必须把这些之前已经正态化的数据提出 正态化的原因:一些情况下正态非正态可以让模型更快的收敛,一些模型要求数据正态(eg. GMM、KNN),保证数据不要过偏态即可,过于偏态可能会影响模型预测结果。 plt.figure(figsize=(16 ,12 )) plt.suptitle('Transaction Values Distribution' , fontsize=22 ) plt.subplot(221 ) sub_plot_1 = sns.distplot(data_tr['loanAmnt' ]) sub_plot_1.set_title("loanAmnt Distribuition" , fontsize=18 ) sub_plot_1.set_xlabel("" ) sub_plot_1.set_ylabel("Probability" , fontsize=15 ) plt.subplot(222 ) sub_plot_2 = sns.distplot(np.log(data_tr['loanAmnt' ])) sub_plot_2.set_title("loanAmnt (Log) Distribuition" , fontsize=18 ) sub_plot_2.set_xlabel("" ) sub_plot_2.set_ylabel("Probability" , fontsize=15 )
Text(0,0.5,'Probability')
非数值类别型变量分析 要映射为数值,不然没办法拉长为一个向量
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
data_tr['grade' ].value_counts()
B 233690
C 227118
A 139661
D 119453
E 55661
F 19053
G 5364
Name: grade, dtype: int64
data_tr['subGrade' ].value_counts()
C1 50763
B4 49516
B5 48965
B3 48600
C2 47068
C3 44751
C4 44272
B2 44227
B1 42382
C5 40264
A5 38045
A4 30928
D1 30538
D2 26528
A1 25909
D3 23410
A3 22655
A2 22124
D4 21139
D5 17838
E1 14064
E2 12746
E3 10925
E4 9273
E5 8653
F1 5925
F2 4340
F3 3577
F4 2859
F5 2352
G1 1759
G2 1231
G3 978
G4 751
G5 645
Name: subGrade, dtype: int64
data_tr['isDefault' ].value_counts()
0 640390
1 159610
Name: isDefault, dtype: int64
总结 上面我们用value_counts()等函数看了特征属性的分布,但是图表是概括原始信息最便捷的方式。 数无形时少直觉。 同一份数据集,在不同的尺度刻画上显示出来的图形反映的规律是不一样的。python将数据转化成图表,但结论是否正确需要由你保证。 变量分布可视化 单一变量分布可视化 plt.figure(figsize=(8 , 8 )) sns.barplot(data_tr["employmentLength" ].value_counts(dropna=False )[:20 ], data_tr["employmentLength" ].value_counts(dropna=False ).keys()[:20 ]) plt.show()
根据y值不同可视化x某个特征的分布 首先查看类别型变量在不同y值上的分布
train_loan_fr = data_tr.loc[data_tr['isDefault' ] == 1 ] train_loan_nofr = data_tr.loc[data_tr['isDefault' ] == 0 ]
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2 , 2 , figsize=(15 , 8 )) train_loan_fr.groupby('grade' )['grade' ].count().plot(kind='barh' , ax=ax1, title='Count of grade fraud' ) train_loan_nofr.groupby('grade' )['grade' ].count().plot(kind='barh' , ax=ax2, title='Count of grade non-fraud' ) train_loan_fr.groupby('employmentLength' )['employmentLength' ].count().plot(kind='barh' , ax=ax3, title='Count of employmentLength fraud' ) train_loan_nofr.groupby('employmentLength' )['employmentLength' ].count().plot(kind='barh' , ax=ax4, title='Count of employmentLength non-fraud' ) plt.show()
其次查看连续型变量在不同y值上的分布
fig, ((ax1, ax2)) = plt.subplots(1 , 2 , figsize=(15 , 6 )) data_tr.loc[data_tr['isDefault' ] == 1 ] \ ['loanAmnt' ].apply(np.log) \ .plot(kind='hist' , bins=100 , title='Log Loan Amt - Fraud' , color='r' , xlim=(-3 , 10 ), ax= ax1) data_tr.loc[data_tr['isDefault' ] == 0 ] \ ['loanAmnt' ].apply(np.log) \ .plot(kind='hist' , bins=100 , title='Log Loan Amt - Not Fraud' , color='b' , xlim=(-3 , 10 ), ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x15ccbd430>
total = len (data_tr) total_amt = data_tr.groupby(['isDefault' ])['loanAmnt' ].sum ().sum () plt.figure(figsize=(12 ,5 )) plt.subplot(121 ) plot_tr = sns.countplot(x='isDefault' ,data=data_tr) plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user" , fontsize=14 ) plot_tr.set_xlabel("Is fraud by count" , fontsize=16 ) plot_tr.set_ylabel('Count' , fontsize=16 ) for p in plot_tr.patches: height = p.get_height() plot_tr.text(p.get_x()+p.get_width()/2. , height + 3 , '{:1.2f}%' .format (height/total*100 ), ha="center" , fontsize=15 ) percent_amt = (data_tr.groupby(['isDefault' ])['loanAmnt' ].sum ()) percent_amt = percent_amt.reset_index() plt.subplot(122 ) plot_tr_2 = sns.barplot(x='isDefault' , y='loanAmnt' , dodge=True , data=percent_amt) plot_tr_2.set_title("Total Amount in loanAmnt \n 0: good user | 1: bad user" , fontsize=14 ) plot_tr_2.set_xlabel("Is fraud by percent" , fontsize=16 ) plot_tr_2.set_ylabel('Total Loan Amount Scalar' , fontsize=16 ) for p in plot_tr_2.patches: height = p.get_height() plot_tr_2.text(p.get_x()+p.get_width()/2. , height + 3 , '{:1.2f}%' .format (height/total_amt * 100 ), ha="center" , fontsize=15 )
时间格式数据处理及查看 data_tr['issueDate' ] = pd.to_datetime(data_tr['issueDate' ],format ='%Y-%m-%d' ) startdate = datetime.datetime.strptime('2007-06-01' , '%Y-%m-%d' ) data_tr['issueDateDT' ] = data_tr['issueDate' ].apply(lambda x: x-startdate).dt.days
data_tt['issueDate' ] = pd.to_datetime(data_tr['issueDate' ],format ='%Y-%m-%d' ) startdate = datetime.datetime.strptime('2007-06-01' , '%Y-%m-%d' ) data_tt['issueDateDT' ] = data_tt['issueDate' ].apply(lambda x: x-startdate).dt.days
plt.hist(data_tr['issueDateDT' ], label='train' ); plt.hist(data_tt['issueDateDT' ], label='test' ); plt.legend(); plt.title('Distribution of issueDateDT dates' );
掌握透视图可以让我们更好的了解数据
pivot = pd.pivot_table(data_tr, index=['grade' ], columns=['issueDateDT' ], values=['loanAmnt' ], aggfunc=np.sum ) pivot
loanAmnt issueDateDT 0 30 61 92 122 153 183 214 245 274 ... 3926 3957 3987 4018 4048 4079 4110 4140 4171 4201 grade A NaN 53650.0 42000.0 19500.0 34425.0 63950.0 43500.0 168825.0 85600.0 101825.0 ... 13093850.0 11757325.0 11945975.0 9144000.0 7977650.0 6888900.0 5109800.0 3919275.0 2694025.0 2245625.0 B NaN 13000.0 24000.0 32125.0 7025.0 95750.0 164300.0 303175.0 434425.0 538450.0 ... 16863100.0 17275175.0 16217500.0 11431350.0 8967750.0 7572725.0 4884600.0 4329400.0 3922575.0 3257100.0 C NaN 68750.0 8175.0 10000.0 61800.0 52550.0 175375.0 151100.0 243725.0 393150.0 ... 17502375.0 17471500.0 16111225.0 11973675.0 10184450.0 7765000.0 5354450.0 4552600.0 2870050.0 2246250.0 D NaN NaN 5500.0 2850.0 28625.0 NaN 167975.0 171325.0 192900.0 269325.0 ... 11403075.0 10964150.0 10747675.0 7082050.0 7189625.0 5195700.0 3455175.0 3038500.0 2452375.0 1771750.0 E 7500.0 NaN 10000.0 NaN 17975.0 1500.0 94375.0 116450.0 42000.0 139775.0 ... 3983050.0 3410125.0 3107150.0 2341825.0 2225675.0 1643675.0 1091025.0 1131625.0 883950.0 802425.0 F NaN NaN 31250.0 2125.0 NaN NaN NaN 49000.0 27000.0 43000.0 ... 1074175.0 868925.0 761675.0 685325.0 665750.0 685200.0 316700.0 315075.0 72300.0 NaN G NaN NaN NaN NaN NaN NaN NaN 24625.0 NaN NaN ... 56100.0 243275.0 224825.0 64050.0 198575.0 245825.0 53125.0 23750.0 25100.0 1000.0
7 rows × 139 columns
用pandas_profiling生成数据报告
import pandas_profilingimport matplotlibpfr = pandas_profiling.ProfileReport(data_tr) pfr.to_file("./example.html" )
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
总结 数据探索性分析是我们初步了解数据,熟悉数据为特征工程做准备的阶段,甚至很多时候EDA阶段提取出来的特征可以直接当作规则来用。可见EDA的重要性,这个阶段的主要工作还是借助于各个简单的统计量来对数据整体的了解,分析各个类型变量相互之间的关系,以及用合适的图形可视化出来直观观察。希望本节内容能给初学者带来帮助,更期待各位学习者对其中的不足提出建议。