Loading...

此篇文章是用来梳理"paper:Causal Analysis Churn"的代码


流程:

  • 1.导入数据集
  • 2.将前两个数据集拼接在一起
  • 3.特征工程:
    • 3.1创建副本 删掉重复行
    • 3.2找到分类列和数量列
    • 3.3删掉高相关性特征
    • 3.4预处理:对类别特征进行one-hot编码
  • 4.模型评估
    • 4.1对训练数据进行欠采样处理
    • 4.2评估指标
  • 5.集成学习
    • 5.1建立集成学习分类器-hard
    • 5.2建立集成学习分类器-soft
  • 6.特征选择-随机森林变量的重要性
    • 6.1训练随机森林分类器
    • 6.2显示重要性最大的前180个
    • 6.3将重要性前100个存入列表,然后将训练数据只保存前100个
    • 6.4将修改后的数据集用于训练,并添加模型评价指标
  • 7.特征挑选方法-递归特征消除
    • 7.1过采样处理,平衡样本量
    • 7.2定义特征选择器,进行特征选择
    • 7.3保留选择后的特征
    • 7.4创建评价指标列表,并将分类器加入到model列表中
    • 7.5训练模型
  • 8.在简单模型上的集成
    • 8.1simple voting
    • 8.2weight voting
    • 8.3weight voting2
  • 9.基于深度学习的集成
    • 9.1对训练数据和测试数据进行特征缩放
    • 9.2定义深度学习模型:ANN1,ANN2
    • 9.3进行模型训练
  • 10.ROC AUC曲线
    • 10.1定义模型
    • 10.2训练模型
    • 10.3定义roc auc函数
    • 10.4展示图像
  • 11.precision-recall曲线
    • 11.1定义函数
    • 11.2绘图
  • 12.dowhy归因分析
  • 13.模型可解释性
    • 13.1eli5展示特征重要性
    • 13.2pdp图(部分依赖图)
    • 13.3shap

1.导入数据集

1
2
3
observation_window1 = pd.read_csv('new_data/features_201506.csv')
observation_window2 = pd.read_csv('new_data/features_201512.csv')
outcome_window = pd.read_csv('new_data/features_201606.csv')

2.将前两个数据集拼接在一起,将observation的churn列去掉换上outcome的churn列

1
2
3
4
#将两个数据集拼接在一起
#参数axis表示拼接的方向,0表示垂直拼接,1表示水平拼接
#参数ignore_index表示忽略原来索引,使新的索引连续
features = pd.concat([observation_window1, observation_window2], axis=0, ignore_index=True)

去掉churn列

1
2
3
4
5
6
# 去除数据中存款低于1500和任期低于6个月的客户数据,这样能增加预测的正确性  
features = features[features['acc_balance']>1500]
features = features[features['acc_tenure']>6]
features = features.reset_index(drop = True) #重置索引
features.drop(['churn'], axis = 1, inplace = True) #删除数据中的churn列,88-1=87
print('Shape of Features : ', features.shape)

添加上outcome的churn列

1
2
3
4
5
#outcome_window中只要id列和churn列
#outcome_window数据集和前两个数据集合并
#两个数据集不同的是feature里面的id数量更少,而且没有churn列,合并之后就会把outcome_window的churn列加入到feature中
outcome_window = outcome_window[['new_id', 'churn']]
df = pd.merge(features, outcome_window, on = 'new_id')

3.特征工程

3.1创建副本 删掉重复行

1
2
3
4
5
6
7
8
9
10
11
# 创建一个之前数据集的副本,检查数据集中是否还有重复行
#df是原数据,finaldf是副本
finalDF = df.copy()
finalDF = finalDF.drop_duplicates()
print("Shape of Combined Dataframe : ", finalDF.shape)

#只考虑id来识别重复列,而不是所有属性
#参数keep=last是保留重复项的最后一个
#inplace=true将结果保存会原始数据框
finalDF.drop_duplicates(subset='new_id', keep='last', inplace=True, ignore_index=False)
print("Shape of Combined Dataframe : ", finalDF.shape)

3.2 找到分类列和数量列

Display uniqueness in each column
显示df每一列的唯一值和缺失值

1
2
3
4
5
6
def summarize_categoricals(df, show_levels=False):

data = [[df[c].unique(), len(df[c].unique()), df[c].isnull().sum()] for c in df.columns]
df_temp = pd.DataFrame(data, index=df.columns,
columns=['Levels', 'No. of Levels', 'No. of Missing Values'])
return df_temp.iloc[:, 0 if show_levels else 1:]# df.iloc提供了非常灵活方便的整数位置索引方法

cutoff是分类的阈值
如果说一个列的唯一值数量小于10,那么他就会被判别为一列分类数据,其列名将会加入到返回的分类列表中
一个好的分类列应该有较少的unique值

1
2
3
4
5
6
7
def find_categorical(df, cutoff=10):

cat_cols = []
for col in df.columns:
if len(df[col].unique()) <= cutoff:
cat_cols.append(col)
return cat_cols

将指定列转换为categorical类型
categorical类型在pandas中meaning表示分类的数据,比普通object类型会更加高效,尤其是在内存使用上

1
2
3
4
5
def to_categorical(columns, df):

for col in columns:
df[col] = df[col].astype('category')
return df

💡总结:先用summarize_categoricals统计每个特征的唯一值和空值,然后根据统计好的唯一值和空值来自动判断是不是类别特征,然后将判断为类别特征的对象转化为类别特征类型


执行:

找到分类列

1
categoricals = find_categorical(finalDF, cutoff=12)

找到数量列

1
numericals = list(set(finalDF.columns.tolist()) - set(categoricals)) + list(set(categoricals) - set(finalDF.columns.tolist()))

3.3 删掉高相关性特征

Objective:移除特征间相关性太高的特征列

Inputs:

  • x: df
  • threshold: 相关性大于此值将被移除

Output: df

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def remove_collinear_features(x, threshold = 0.99):

# 计算特征间的相关系数矩阵
corr_matrix = x.corr() #df
iters = range(len(corr_matrix.columns) - 1) #列名数-1,也就是索引恰好是最大范围
drop_cols = [] #需要drop的列

# Iterate through the correlation matrix and compare correlations
for i in iters:
for j in range(i+1):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)] # series,切片机制的区间是[j,j+1)所以区间中所包含的元素只有j,即item就是i和j的相关系数
col = item.columns #此相关系数的列名 是index对象
row = item.index # 此相关系数的行名 也是index对象
val = abs(item.values) #相关系数的值的绝对值,ndarray类型

# 如果相关性高于阈值
if val >= threshold:

#col.values虽然只有一个元素,但是他的数据类型是ndarray数组,所以必须加上索引[0],得到字符串类型
#将该列名加入到adrop_cols中
print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(col.values[0])

# 移除
drops = set(drop_cols)
x = x.drop(columns=drops)

return x

执行

1
2
#移除特征相关性超过0.9的特征并print
finalDF = remove_collinear_features(finalDF, threshold = 0.9)

3.4 预处理:对类别特征进行one-hot编码

one-hot编码

1
2
3
4
5
6
7
8
9
10
11
def one_hot_encoding(df, cols):
"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with one-hot encoding
"""
for each in cols: #遍历每一列
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False) #每一列都进行onehot编码
df = pd.concat([df, dummies], axis=1) #onehot-df和原df合并
df = df.drop(each, 1) #删除原始列
return df

规范化处理

1
2
3
4
5
6
7
8
9
10
11
12
13
def normalize(df, cols):
"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with normalized specified features
"""
result = df.copy() # 原始的df不会被修改
for feature_name in cols:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
if max_value > min_value:
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result

执行:
对类别特征进行one-hot编码

1
2
3
4
5
6
7
8
9
finalDF = one_hot_encoding(finalDF, categoricals)
```

### 3.5 切分成特征和标签的数据框

标准化数量特征
```python
numeric_cols = list(finalDF.dtypes[finalDF.dtypes != 'object'].index) #将object类型以外的特征名称加入到numeric_cols中
finalDF.loc[:,numeric_cols] = scaler.fit_transform(finalDF.loc[:,numeric_cols]) #标准化后在放缩到E=0,D=1

创建分类特征和数值特征的列表

1
2
categorical_columns = list(x.select_dtypes(include='category').columns)
numeric_columns = list(x.select_dtypes(exclude='category').columns)

切分数据集

1
2
3
data_splits = train_test_split(x, y, test_size=0.2, random_state=0,
shuffle=True)
x_train, x_test, y_train, y_test = data_splits

4.模型评估

4.1 对训练数据进行欠采样处理

进行欠采样处理,删除掉训练数据中的一部分,使得样本数量变得平衡

1
2
X_train, y_train = RandomUnderSampler().fit_resample(X_train, y_train)
Y_train = y_train.copy()

4.2 评估指标

定义多个列表来存储不同评估指标

1
2
3
4
5
6
7
8
9
10
11
train_accuracy = [] #训练集精度
test_accuracy = [] #测试集精度
precision = [] #精确率
recall = [] #召回率
f1 = [] #F1指标
cohen_kappa = [] #系数
models = ["Naive Bayes","Logistic Regression","Decision Tree","RandomForest", "AdaBoost", "ExtraTrees","GradientBoosting","XGboost"]
roc = [] #roc曲线下的auc值
mathew = [] #mathew系数
random_state = 2 #随机指数
classifiers = []

添加分类器

1
2
3
4
5
6
7
8
classifiers.append(BernoulliNB())
classifiers.append(LogisticRegression())
classifiers.append(DecisionTreeClassifier())
classifiers.append(RandomForestClassifier(random_state=random_state, max_depth = 10, max_features = 'sqrt', n_estimators= 300))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.5))
classifiers.append(ExtraTreesClassifier(random_state=random_state, criterion ='entropy', max_features = 'sqrt', min_samples_leaf = 20, min_samples_split = 15))
classifiers.append(GradientBoostingClassifier(random_state=random_state, learning_rate = 0.2, max_depth = 10, n_estimators = 200))
classifiers.append(XGBClassifier(random_state = random_state))

训练不同的分类模型,并计算各种重要的评价指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
for classifier,model in zip(classifiers, models):
print('='*len(model))
print(model)
print('='*len(model))
classifier.fit(X_train, y_train) #训练分类模型
trainprediction = classifier.predict(X_train)
prediction = classifier.predict(X_test) #对新数据进行预测

trainaccuracy = accuracy_score(y_train, trainprediction) #acc
testaccuracy = accuracy_score(y_test, prediction)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, prediction, average='macro'))# precision
recall.append(recall_score(y_test, prediction, average='macro'))#recall
cohen_kappa.append(cohen_kappa_score(y_test, prediction))#kappa
f1.append(f1_score(y_test, prediction, average='macro'))#f1
roc.append(metrics.roc_auc_score(y_test, prediction))#roc

mathew.append(metrics.matthews_corrcoef(y_test, prediction)) #mathew

print('\n clasification report:\n', classification_report(y_test,prediction))
print('\n confussion matrix:\n',confusion_matrix(y_test, prediction))
print('\n')

5.集成学习

5.1 建立集成学习器(硬投票) ,并将指标存入

1
2
3
4
5
6
nsemble = VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(random_state = random_state)),  
('Naive Bayes', GaussianNB()),
('RF', RandomForestClassifier(random_state=random_state)),
('KNN', KNeighborsClassifier()),
('Decision Tree', DecisionTreeClassifier(random_state=random_state))],
voting='hard').fit(X_train,y_train)

训练集成学习模型,并将指标存入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
y_train_ensemble = ensemble.predict(X_train)
y_pred_ensemble = ensemble.predict(X_test)

trainaccuracy = accuracy_score(y_train, y_train_ensemble)
testaccuracy = accuracy_score(y_test, y_pred_ensemble)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, y_pred_ensemble, average='macro'))

recall.append(recall_score(y_test, y_pred_ensemble, average='macro'))

cohen_kappa.append(cohen_kappa_score(y_test, y_pred_ensemble))

f1.append(f1_score(y_test, y_pred_ensemble, average='macro'))

roc.append(metrics.roc_auc_score(y_test, y_pred_ensemble))

mathew.append(metrics.matthews_corrcoef(y_test, y_pred_ensemble))

5.2 建立集成学习器(软投票) ,并将指标存入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#与ensemble1不同的是:这里是soft voting
ensemble2 = VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(random_state = random_state)),
('Naive Bayes', GaussianNB()),
('RF', RandomForestClassifier(random_state=random_state)),
('KNN', KNeighborsClassifier()),
('Decision Tree', DecisionTreeClassifier(random_state=random_state))],
voting='soft').fit(X_train,y_train)

y_train_ensemble2 = ensemble2.predict(X_train)
y_pred_ensemble2 = ensemble2.predict(X_test)

trainaccuracy = accuracy_score(y_train, y_train_ensemble2)
testaccuracy = accuracy_score(y_test, y_pred_ensemble2)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, y_pred_ensemble2, average='macro'))

recall.append(recall_score(y_test, y_pred_ensemble2, average='macro'))

cohen_kappa.append(cohen_kappa_score(y_test, y_pred_ensemble2))

f1.append(f1_score(y_test, y_pred_ensemble2, average='macro'))

roc.append(metrics.roc_auc_score(y_test, y_pred_ensemble2))

mathew.append(metrics.matthews_corrcoef(y_test, y_pred_ensemble2))

print('\n Accuracy Score:\n', np.round(accuracy_score(y_test, y_pred_ensemble2), 3))
sns.heatmap(confusion_matrix(y_test,y_pred_ensemble2),annot=True,fmt='2.0f')
print('\n clasification report:\n', classification_report(y_test,y_pred_ensemble2))
print('\n Cohen Kappa Score:\n', metrics.cohen_kappa_score(y_test, y_pred_ensemble2))
plt.show()

6.特征选择-随机森林变量重要性

6.1 训练随机森林分类器

1
2
3
4
5
6
7
# 初始化随机森林分类器,并将🌲的个数设置为500
clf = RandomForestClassifier(n_estimators=500, max_depth = 15, random_state=0)
# 用训练集拟合模型
clf.fit(X_train, y_train)

# 对测试集进行预测
y_pred = clf.predict(X_test)

6.2 显示重要性最大的前180个

1
2
3
4
5
6
7
8
# 创建series对象,内容包含随机森林分类器的重要性得分
feat_importances = pd.Series(clf.feature_importances_, index= X_train.columns)

#显示重要性最大的前180个
feat_importances.nlargest(180).plot(kind='barh')
plt.title('Correlation by Weights')
plt.yticks(fontsize = 12)
plt.show();

6.3将重要性最高的前100个存入列表,然后将训练数据中的特征只保存前100个

1
2
3
4
5
# 将重要性最高的前一百个特征存入到列表中
features_to_select = feat_importances.nlargest(100).index.tolist()

X_train = X_train[features_to_select]
X_test = X_test[features_to_select]

6.4将修改后的数据集用于训练,并添加评价指标

创建评价指标,将分类器加入到模型中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
train_accuracy = []
test_accuracy = []
precision = []
recall = []
f1 = []
cohen_kappa = []
models = ["Naive Bayes","Logistic Regression","Decision Tree","RandomForest", "AdaBoost", "ExtraTrees","GradientBoosting","XGboost"]
roc = []
mathew = []
random_state = 2
classifiers = []
classifiers.append(BernoulliNB())
classifiers.append(LogisticRegression())
classifiers.append(DecisionTreeClassifier())
classifiers.append(RandomForestClassifier(random_state=random_state, max_depth = 15, max_features = 'auto', n_estimators= 500))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.5))
classifiers.append(ExtraTreesClassifier(random_state=random_state, criterion ='gini', max_features = 'auto', min_samples_leaf = 20, min_samples_split = 15))
classifiers.append(GradientBoostingClassifier(random_state=random_state, learning_rate = 0.01, max_depth = 15, n_estimators = 500))
classifiers.append(XGBClassifier(random_state = random_state))

训练模型并将指标存入列表中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
for classifier,model in zip(classifiers, models):
print('='*len(model))
print(model)
print('='*len(model))

classifier.fit(X_train, y_train)

trainprediction = classifier.predict(X_train)
prediction = classifier.predict(X_test)

trainaccuracy = accuracy_score(y_train, trainprediction)
testaccuracy = accuracy_score(y_test, prediction)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, prediction))

recall.append(recall_score(y_test, prediction))

cohen_kappa.append(cohen_kappa_score(y_test, prediction))

f1.append(f1_score(y_test, prediction))

roc.append(metrics.roc_auc_score(y_test, prediction))

mathew.append(metrics.matthews_corrcoef(y_test, prediction))

print('\n clasification report:\n', classification_report(y_test,prediction))
print('\n confussion matrix:\n',confusion_matrix(y_test, prediction))
print('\n')
1
2
3
4
5
6
7
8
9
scoreDF = pd.DataFrame({'Model' : models})
scoreDF['Train Accuracy'] = train_accuracy
scoreDF['Test Accuracy'] = test_accuracy
scoreDF['Precision'] = precision
scoreDF['Recall'] = recall
scoreDF['F1 Score'] = f1
scoreDF['AUC Score'] = roc
scoreDF['Matthew Correlation Coefficient'] = mathew
scoreDF['Cohen Kappa Score'] = cohen_kappa

7. 特征挑选方法-递归特征消除(这个方法和6是并列的关系)

7.1 过采样处理,平衡样本量

1
2
3
4
5
6
from imblearn.over_sampling import SMOTE
X_train, y_train = SMOTE('minority').fit_resample(X_train, y_train)
Y_train = y_train.copy()

print('Train data shape: {}'.format(X_train.shape))
print('Test data shape: {}'.format(X_test.shape))

7.2 定义特征选择器,进行特征选择

1
2
3
4
5
6
7
# 定义特征选择器
rfe_selector = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=100, step=10, verbose=3)
rfe_selector.fit(X_train, y_train)

rfe_support = rfe_selector.get_support() #获取特征选择结果
rfe_feature = X_train.loc[:,rfe_support].columns.tolist() #将选择后的特征加入到列表中
print(str(len(rfe_feature)), 'selected features')

7.3 保留选择后的特征

1
2
X_train = X_train[rfe_feature]
X_test = X_test[rfe_feature]

7.4 创建评价指标列表,并将分类器加入到models列表中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
train_accuracy = []
test_accuracy = []
precision = []
recall = []
f1 = []
cohen_kappa = []
models = ["Naive Bayes","Logistic Regression","Decision Tree","RandomForest", "AdaBoost", "ExtraTrees","GradientBoosting","XGboost"]
roc = []
mathew = []
random_state = 2
classifiers = []
classifiers.append(BernoulliNB())
classifiers.append(LogisticRegression())
classifiers.append(DecisionTreeClassifier())
classifiers.append(RandomForestClassifier(random_state=random_state, max_depth = 15, max_features = 'auto', n_estimators= 500))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.5))
classifiers.append(ExtraTreesClassifier(random_state=random_state, criterion ='gini', max_features = 'auto', min_samples_leaf = 20, min_samples_split = 15))
classifiers.append(GradientBoostingClassifier(random_state=random_state, learning_rate = 0.01, max_depth = 15, n_estimators = 500))
classifiers.append(XGBClassifier(random_state = random_state))

7.5 训练模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
for classifier,model in zip(classifiers, models):
print('='*len(model))
print(model)
print('='*len(model))

classifier.fit(X_train, y_train)

trainprediction = classifier.predict(X_train)
prediction = classifier.predict(X_test)

trainaccuracy = accuracy_score(y_train, trainprediction)
testaccuracy = accuracy_score(y_test, prediction)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, prediction))

recall.append(recall_score(y_test, prediction))

cohen_kappa.append(cohen_kappa_score(y_test, prediction))

f1.append(f1_score(y_test, prediction))

roc.append(metrics.roc_auc_score(y_test, prediction))

mathew.append(metrics.matthews_corrcoef(y_test, prediction))

print('\n clasification report:\n', classification_report(y_test,prediction))
print('\n confussion matrix:\n',confusion_matrix(y_test, prediction))
print('\n')

scoreDF = pd.DataFrame({'Model' : models})
scoreDF['Train Accuracy'] = train_accuracy
scoreDF['Test Accuracy'] = test_accuracy
scoreDF['Precision'] = precision
scoreDF['Recall'] = recall
scoreDF['F1 Score'] = f1
scoreDF['AUC Score'] = roc
scoreDF['Matthew Correlation Coefficient'] = mathew
scoreDF['Cohen Kappa Score'] = cohen_kappa

8.在简单模型上的集成

8.1 simple voting

训练分类器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
LR = LogisticRegression(random_state = random_state, n_jobs=-1)
NB = BernoulliNB()
KNN = KNeighborsClassifier(n_jobs=-1)
DT = DecisionTreeClassifier(random_state=random_state)
RF = RandomForestClassifier(random_state=random_state)
BG = BaggingClassifier()
XGB = XGBClassifier()
ADA = AdaBoostClassifier()
GBM = GradientBoostingClassifier()
ET = ExtraTreesClassifier()


LR.fit(X_train, y_train)
NB.fit(X_train, y_train)
RF.fit(X_train, y_train)
KNN.fit(X_train, y_train)
DT.fit(X_train, y_train)
BG.fit(X_train, y_train)
XGB.fit(X_train, y_train)
ADA.fit(X_train, y_train)
GBM.fit(X_train, y_train)
ET.fit(X_train, y_train)

进行预测

1
2
3
4
5
6
7
8
9
10
LR_pred = LR.predict(X_test)
NB_pred = NB.predict(X_test)
RF_pred = RF.predict(X_test)
KNN_pred = KNN.predict(X_test)
DT_pred = DT.predict(X_test)
BG_pred = BG.predict(X_test)
ADA_pred = BG.predict(X_test)
XGB_pred = BG.predict(X_test)
ET_pred = BG.predict(X_test)
GBM_pred = BG.predict(X_test)

宏平均求得结果

1
2
3
4
5
6
7
averaged_preds = (LR_pred + NB_pred + RF_pred + KNN_pred + DT_pred + BG_pred + ADA_pred +XGB_pred + ET_pred + GBM_pred)//10
acc = accuracy_score(y_test, averaged_preds)
print('\n Accuracy Score:\n', np.round(acc, 3))

print('\n clasification report:\n', classification_report(y_test,averaged_preds))
print('\n confussion matrix:\n', metrics.confusion_matrix(y_test, averaged_preds))
print('\n Cohen Kappa Score:\n', metrics.cohen_kappa_score(y_test, averaged_preds))

8.2 weight averaging

定义集成学习器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ensemble = VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(random_state = random_state)),
('Naive Bayes', GaussianNB()),
('RF', RandomForestClassifier(random_state=random_state)),
('KNN', KNeighborsClassifier()),
('Decision Tree', DecisionTreeClassifier(random_state=random_state)),
('Bagging Classifier', BaggingClassifier(random_state = random_state)),
('GBM', GradientBoostingClassifier()),
('ADA', AdaBoostClassifier(random_state=random_state)),
('ET', ExtraTreesClassifier()),
('XGB', XGBClassifier(random_state=random_state))



],
voting='hard').fit(X_train,y_train)

y_pred_ensemble = ensemble.predict(X_test)

计算评价指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

y_train_ensemble = ensemble.predict(X_train)
y_pred_ensemble = ensemble.predict(X_test)

trainaccuracy = accuracy_score(y_train, y_train_ensemble)
testaccuracy = accuracy_score(y_test, y_pred_ensemble)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, y_pred_ensemble, average='macro'))

recall.append(recall_score(y_test, y_pred_ensemble, average='macro'))

cohen_kappa.append(cohen_kappa_score(y_test, y_pred_ensemble))

f1.append(f1_score(y_test, y_pred_ensemble, average='macro'))

roc.append(metrics.roc_auc_score(y_test, y_pred_ensemble))

mathew.append(metrics.matthews_corrcoef(y_test, y_pred_ensemble))

print('\n Accuracy Score:\n', np.round(accuracy_score(y_test, y_pred_ensemble), 3))
sns.heatmap(confusion_matrix(y_test,y_pred_ensemble),annot=True,fmt='2.0f')
print('\n clasification report:\n', classification_report(y_test,y_pred_ensemble))
print('\n Cohen Kappa Score:\n', metrics.cohen_kappa_score(y_test, y_pred_ensemble))

plt.show()

8.3 weight averaging2

定义集成学习分类器并进行训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ensemble = VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(random_state = random_state)),
('Naive Bayes', GaussianNB()),
('RF', RandomForestClassifier(random_state=random_state)),
('KNN', KNeighborsClassifier()),
('Decision Tree', DecisionTreeClassifier(random_state=random_state)),
('Bagging Classifier', BaggingClassifier(random_state = random_state)),
('GBM', GradientBoostingClassifier()),
('ADA', AdaBoostClassifier(random_state=random_state)),
('ET', ExtraTreesClassifier()),
('XGB', XGBClassifier(random_state=random_state))



],
voting='soft').fit(X_train,y_train)

y_pred_ensemble = ensemble.predict(X_test)

y_train_ensemble = ensemble.predict(X_train)
y_pred_ensemble = ensemble.predict(X_test)

计算评价指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
trainaccuracy = accuracy_score(y_train, y_train_ensemble)
testaccuracy = accuracy_score(y_test, y_pred_ensemble)
train_accuracy.append(trainaccuracy)
test_accuracy.append(testaccuracy)

precision.append(precision_score(y_test, y_pred_ensemble, average='macro'))

recall.append(recall_score(y_test, y_pred_ensemble, average='macro'))

cohen_kappa.append(cohen_kappa_score(y_test, y_pred_ensemble))

f1.append(f1_score(y_test, y_pred_ensemble, average='macro'))

roc.append(metrics.roc_auc_score(y_test, y_pred_ensemble))

mathew.append(metrics.matthews_corrcoef(y_test, y_pred_ensemble))


print('\n Accuracy Score:\n', np.round(accuracy_score(y_test, y_pred_ensemble), 3))
sns.heatmap(confusion_matrix(y_test,y_pred_ensemble),annot=True,fmt='2.0f')
print('\n clasification report:\n', classification_report(y_test,y_pred_ensemble))
print('\n Cohen Kappa Score:\n', metrics.cohen_kappa_score(y_test, y_pred_ensemble))
plt.show()

9.基于深度学习的集成

9.1对训练数据和测试数据进行特征缩放

1
2
3
scaler = MinMaxScaler(feature_range=(0, 1))
X_train = scaler.fit_transform(X_train)# 与下面那行代码不同的是:这里先进性fit,然后在将特征放缩到0,1之间
X_test = scaler.transform(X_test)

9.2 定义深度学习模型: ANN1,ANN2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def model_ANN1(input_shape=X_train.shape[1], num_classes=2):   
model = Sequential()

model.add(Dense(128, activation='tanh', input_dim=X_train.shape[1]))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(8, activation='relu'))
model.add(Dropout(0.2))
# Lets add softmax activated neurons as much as number of classes
model.add(Dense(1, activation = "sigmoid"))
# Compile the model with loss and metrics
model.compile(optimizer = Adam(learning_rate = 0.00001, decay = 1e-5) , loss = "binary_crossentropy", metrics=["accuracy"])

return model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

def model_ANN2(input_shape=X_train.shape[1], num_classes=2):
model = Sequential()
model.add(Dense(128, activation='tanh', input_dim=X_train.shape[1]))
model.add(Dropout(0.4))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(8, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.4))

# Lets add softmax activated neurons as much as number of classes
model.add(Dense(1, activation = "sigmoid"))
# Compile the model with loss and metrics
model.compile(optimizer = Adam(learning_rate = 0.00001, decay = 1e-5) , loss = "binary_crossentropy", metrics=["accuracy"])

return model

9.3 进行模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
models = []
for i in range(len(model)):
model[i].fit(X_train,y_train, batch_size=512,
epochs = 100,
validation_data = (X_test,y_test),
callbacks=[ReduceLROnPlateau(monitor='loss', patience=3, factor=0.1)],
verbose=2)
models.append(model[i])
```


### 9.4 打印模型指评价标
```python
model_results =pd.DataFrame([['ANN Ensemble Classifier',train_acc, acc, prec,rec, f1,roc,mathew, ck]],
columns = ['Model', 'Train Accuracy', 'Test Accuracy','Precision', 'Recall', 'F1 Score','AUC Score','Matthew Correlation Coefficient', 'Cohen Kappa Score'])


model_results = model_results.set_index('Model')

model_results.index.name = None

model_results

10.ROC AUC曲线

10.1 定义模型

同上

10.2 训练模型

同上

10.3 定义roc auc函数

1
2
3
4
5
def roc_auc_plot(y_true, y_proba, label=' ', l='-', lw=1.0):
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, _ = roc_curve(y_true, y_proba[:,1])
ax.plot(fpr, tpr, linestyle=l, linewidth=lw,
label="%s (area=%.3f)"%(label,roc_auc_score(y_true, y_proba[:,1])))

10.4 展示图像

1
2
3
4
5
6
7
8
9
10
11
12
13
14
roc_auc_plot(y_test,LR.predict_proba(X_test),label='Logsitic Regression Classifier ',l='-')
roc_auc_plot(y_test,NB.predict_proba(X_test),label='Naive Bayes Classifier',l='-')
roc_auc_plot(y_test,DT.predict_proba(X_test),label='Decision TreeClassifier ',l='-')
roc_auc_plot(y_test,RF.predict_proba(X_test),label='Random Forest Classifier ',l='-')
roc_auc_plot(y_test,KNN.predict_proba(X_test),label='KNN Classifier ',l='-')

roc_auc_plot(y_test,BG.predict_proba(X_test),label='Bagging Classifier ',l='-')
roc_auc_plot(y_test,XGB.predict_proba(X_test),label='XGboost',l='-')
roc_auc_plot(y_test,ADA.predict_proba(X_test),label='Adaooost Classifier ',l='-')
roc_auc_plot(y_test,GBM.predict_proba(X_test),label='Gradient Boosting Machine Classifier ',l='-')
roc_auc_plot(y_test,ET.predict_proba(X_test),label='Extra Trees Classifier ',l='-')
roc_auc_plot(y_test,ensemble2.predict_proba(X_test),label='Ensemble Classifier (Soft Voting) ',l='-')
roc_auc_plot(y_test,ensemble2.predict_proba(X_test),label='Ensemble Classifier (Hard Voting) ',l='-')
roc_auc_plot(y_test,model[0].predict(X_test),label='ANN Classifier',l='-')

11.precision-recall曲线

11.1定义函数

1
2
3
4
5
6
7
8
def precision_recall_plot(y_true, y_proba, label=' ', l='-', lw=1.0):
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, _ = precision_recall_curve(y_test,
y_proba[:,1])
average_precision = average_precision_score(y_test, y_proba[:,1],
average="micro") #计算微平均下的精确率
ax.plot(recall, precision, label='%s (average=%.3f)'%(label,average_precision),
linestyle=l, linewidth=lw)

11.2 绘图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
precision_recall_plot(y_test,LR.predict_proba(X_test),label='Logsitic Regression Classifier ',l='-')
precision_recall_plot(y_test,NB.predict_proba(X_test),label='Naive Bayes Classifier',l='-')
precision_recall_plot(y_test,DT.predict_proba(X_test),label='Decision TreeClassifier ',l='-')
precision_recall_plot(y_test,RF.predict_proba(X_test),label='Random Forest Classifier ',l='-')
precision_recall_plot(y_test,KNN.predict_proba(X_test),label='KNN Classifier ',l='-')

precision_recall_plot(y_test,BG.predict_proba(X_test),label='Bagging Classifier ',l='-')
precision_recall_plot(y_test,XGB.predict_proba(X_test),label='XGboost',l='-')
precision_recall_plot(y_test,ADA.predict_proba(X_test),label='Adaooost Classifier ',l='-')
precision_recall_plot(y_test,GBM.predict_proba(X_test),label='Gradient Boosting Machine Classifier ',l='-')
precision_recall_plot(y_test,ET.predict_proba(X_test),label='Extra Trees Classifier ',l='-')
precision_recall_plot(y_test,ensemble2.predict_proba(X_test),label='Ensemble Classifier (Soft Voting) ',l='-')
precision_recall_plot(y_test,ensemble2.predict_proba(X_test),label='Ensemble Classifier (Hard Voting) ',l='-')
precision_recall_plot(y_test,model[0].predict(X_test),label='ANN Classifier',l='-')

12 dowhy 归因分析

13. 模型可解释性

13.1 eli5展示特征重要性

1
2
3
4
5
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(final_model, random_state=1).fit(X_test,y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

13.2 pdp图(部分依赖图)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 特征名称
features = ['acc_tenure','cust_tenure', 'acc_balance_change_amount','fund_performance', 'acc_balance_change_ratio',
'insurance_recency','account_growth','has_mobile_0','num_contacts_0','returned_mail_count_0','login_recency','num_accounts_1.0','promotional_pref_M',
'postcode_change_changed','has_email_0','stmt_pref_M','account_growth_change','home_tel_change_changed','email_change_same', 'stmt_pref_change_same']



from pdpbox import pdp, get_dataset, info_plots


for feature_name in features:
# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_test, model_features=X_train.columns.tolist(), feature=feature_name);

# plot it
pdp.pdp_plot(pdp_goals, feature_name);
plt.show();

13.3 shap

单个数据的shap图

1
2
3
4
5
6
7
8
9
10
# Create object that can calculate shap values
explainer = shap.TreeExplainer(tree_model)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
# 计算所有数据而不是单个数据
shap_values = explainer.shap_values(X_test)

# Make plot. Index of [1] is explained in text below.
shap.summary_plot(shap_values[1],X_test)

总体数据的shap图

1
2
3
4
5
6
7
8
9
10
# Create object that can calculate shap values
explainer = shap.TreeExplainer(tree_model)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
# 计算所有数据而不是单个数据
shap_values = explainer.shap_values(X_test)

# Make plot. Index of [1] is explained in text below.
shap.summary_plot(shap_values[1],X_test)