Feature Engineering¶

Polynomial Features hinzufügen

import pandas as pd
pd.set_option('display.max_columns', 50)

import numpy as np

import os
os.chdir('D:\Data\Projects\Klassifikation\Heart Disease')

import plotly_express as px

import matplotlib.pyplot as plt
plt.style.use('Solarize_Light2')

import seaborn as sns

from warnings import filterwarnings
filterwarnings('ignore')

from IPython.core.pylabtools import figsize
figsize(10, 10)
plt.rcParams['font.size'] = 15

from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier

Original DataFrame¶

df = pd.read_csv('heart.csv')
df.shape

(303, 14)

df.head()

Lineare Korrelation mit Zielvariablen¶

df.corr()['target'].sort_values(ascending=False)

target      1.000000
cp          0.433798
thalach     0.421741
slope       0.345877
restecg     0.137230
fbs        -0.028046
chol       -0.085239
trestbps   -0.144931
age        -0.225439
sex        -0.280937
thal       -0.344029
ca         -0.391724
oldpeak    -0.430696
exang      -0.436757
Name: target, dtype: float64

Feature Importances mit dem Random Forest Classifier¶

Mit Random Forest auf dem kompletten Datensatz die Feature Importances lernen.

plt.rcParams['font.size'] = 20
x = df.drop('target', axis=1)
y = df.target

params = {'random_state': 0, 'n_jobs': 4, 'n_estimators': 5000, 'max_depth': 8}

clf = RandomForestClassifier(**params)
clf = clf.fit(x, y)

# Plot features importances
fi = pd.Series(data=clf.feature_importances_, index=x.columns).sort_values(ascending=False)
plt.figure(figsize=(15,10))
plt.title("Feature Importances", fontsize = 25)
ax = sns.barplot(y=fi.index, x=fi.values, palette="Blues_d", orient='h')

DataFrame mit polynomialen Features und Interactions aus den wichtigsten numerischen Features¶

Kombinationen vorhandene Variablen und das Potenzieren von Variablen können Beziehungen zwischen diesen und der Zielvariablen hervorzubringen, die zuvor nicht klar ersichtlich waren. Wenn eine Variable alleine keine große Korrelation mit dem Ziel hat, so können Multiplikationen zweier Variablen eine bedeutsame Korrelation mit diesem haben. Diese Methode muss mit Bedacht angewendet werden, um Overfitting zu vermeiden. Zudem gibt es die Gefahr des "Fluchs der Dimensionalität"

poly_features = df[['cp', 'thalach', 'ca', 'oldpeak', 'thal', 'age']]
poly_transformer = PolynomialFeatures(degree = 3)

# Trainieren von poly
poly_transformer.fit(poly_features)

# Transformieren der Features
poly_features = poly_transformer.transform(poly_features)

cols = poly_transformer.get_feature_names(['cp', 'thalach', 'ca', 'oldpeak', 'thal', 'age'])

poly_features = pd.DataFrame(poly_features, columns = cols)
poly_features.head()

# Originalspalten entfernen, da diese im Original DF vorhanden sind
poly_features = poly_features.drop(['cp', 'thalach', 'ca', 'oldpeak', 'thal', 'age'], axis=1)

# mit Original DF verschmelzen
df_poly = df.join(poly_features, how='outer')

df_poly.head()

Datensatz mit Polynomial Features speichern¶

#df_poly.to_csv('df_poly.csv', sep=',', encoding='utf-8', index=False)

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	1	cp	thalach	oldpeak	thal	age	cp^2	cp thalach	cp oldpeak	cp thal	cp age	thalach^2	thalach oldpeak	thalach thal	thalach age	oldpeak^2	oldpeak thal	oldpeak age	...	thalach oldpeak thal	thalach oldpeak age	thalach thal^2	thalach thal age	thalach age^2	oldpeak^3	oldpeak^2 thal	oldpeak^2 age	oldpeak thal^2	oldpeak thal age	oldpeak age^2	thal^3	thal^2 age	thal age^2	age^3
0	1.0	3.0	150.0	2.3	1.0	63.0	9.0	450.0	6.9	3.0	189.0	22500.0	345.0	150.0	9450.0	5.29	2.3	144.9	...	345.0	21735.0	150.0	9450.0	595350.0	12.167	5.29	333.27	2.3	144.9	9128.7	1.0	63.0	3969.0	250047.0
1	1.0	2.0	187.0	3.5	2.0	37.0	4.0	374.0	7.0	4.0	74.0	34969.0	654.5	374.0	6919.0	12.25	7.0	129.5	...	1309.0	24216.5	748.0	13838.0	256003.0	42.875	24.50	453.25	14.0	259.0	4791.5	8.0	148.0	2738.0	50653.0
2	1.0	1.0	172.0	1.4	2.0	41.0	1.0	172.0	1.4	2.0	41.0	29584.0	240.8	344.0	7052.0	1.96	2.8	57.4	...	481.6	9872.8	688.0	14104.0	289132.0	2.744	3.92	80.36	5.6	114.8	2353.4	8.0	164.0	3362.0	68921.0
3	1.0	1.0	178.0	0.8	2.0	56.0	1.0	178.0	0.8	2.0	56.0	31684.0	142.4	356.0	9968.0	0.64	1.6	44.8	...	284.8	7974.4	712.0	19936.0	558208.0	0.512	1.28	35.84	3.2	89.6	2508.8	8.0	224.0	6272.0	175616.0
4	1.0	0.0	163.0	0.6	2.0	57.0	0.0	0.0	0.0	0.0	0.0	26569.0	97.8	326.0	9291.0	0.36	1.2	34.2	...	195.6	5574.6	652.0	18582.0	529587.0	0.216	0.72	20.52	2.4	68.4	1949.4	8.0	228.0	6498.0	185193.0

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target	1	cp^2	cp thalach	cp oldpeak	cp thal	cp age	thalach^2	thalach oldpeak	thalach thal	...	thalach oldpeak thal	thalach oldpeak age	thalach thal^2	thalach thal age	thalach age^2	oldpeak^3	oldpeak^2 thal	oldpeak^2 age	oldpeak thal^2	oldpeak thal age	oldpeak age^2	thal^3	thal^2 age	thal age^2	age^3
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1	1.0	9.0	450.0	6.9	3.0	189.0	22500.0	345.0	150.0	...	345.0	21735.0	150.0	9450.0	595350.0	12.167	5.29	333.27	2.3	144.9	9128.7	1.0	63.0	3969.0	250047.0
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1	1.0	4.0	374.0	7.0	4.0	74.0	34969.0	654.5	374.0	...	1309.0	24216.5	748.0	13838.0	256003.0	42.875	24.50	453.25	14.0	259.0	4791.5	8.0	148.0	2738.0	50653.0
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1	1.0	1.0	172.0	1.4	2.0	41.0	29584.0	240.8	344.0	...	481.6	9872.8	688.0	14104.0	289132.0	2.744	3.92	80.36	5.6	114.8	2353.4	8.0	164.0	3362.0	68921.0
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1	1.0	1.0	178.0	0.8	2.0	56.0	31684.0	142.4	356.0	...	284.8	7974.4	712.0	19936.0	558208.0	0.512	1.28	35.84	3.2	89.6	2508.8	8.0	224.0	6272.0	175616.0
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1	1.0	0.0	0.0	0.0	0.0	0.0	26569.0	97.8	326.0	...	195.6	5574.6	652.0	18582.0	529587.0	0.216	0.72	20.52	2.4	68.4	1949.4	8.0	228.0	6498.0	185193.0

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1