Feature Engineering

Polynomial Features hinzufügen

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 50)

import numpy as np

import os
os.chdir('D:\Data\Projects\Klassifikation\Heart Disease')

import plotly_express as px

import matplotlib.pyplot as plt
plt.style.use('Solarize_Light2')

import seaborn as sns

from warnings import filterwarnings
filterwarnings('ignore')

from IPython.core.pylabtools import figsize
figsize(10, 10)
plt.rcParams['font.size'] = 15
In [2]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier

Original DataFrame

In [3]:
df = pd.read_csv('heart.csv')
df.shape
Out[3]:
(303, 14)
In [4]:
df.head()
Out[4]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

Lineare Korrelation mit Zielvariablen

In [5]:
df.corr()['target'].sort_values(ascending=False)
Out[5]:
target      1.000000
cp          0.433798
thalach     0.421741
slope       0.345877
restecg     0.137230
fbs        -0.028046
chol       -0.085239
trestbps   -0.144931
age        -0.225439
sex        -0.280937
thal       -0.344029
ca         -0.391724
oldpeak    -0.430696
exang      -0.436757
Name: target, dtype: float64

Feature Importances mit dem Random Forest Classifier

Mit Random Forest auf dem kompletten Datensatz die Feature Importances lernen.

In [6]:
plt.rcParams['font.size'] = 20
x = df.drop('target', axis=1)
y = df.target

params = {'random_state': 0, 'n_jobs': 4, 'n_estimators': 5000, 'max_depth': 8}

clf = RandomForestClassifier(**params)
clf = clf.fit(x, y)

# Plot features importances
fi = pd.Series(data=clf.feature_importances_, index=x.columns).sort_values(ascending=False)
plt.figure(figsize=(15,10))
plt.title("Feature Importances", fontsize = 25)
ax = sns.barplot(y=fi.index, x=fi.values, palette="Blues_d", orient='h')

DataFrame mit polynomialen Features und Interactions aus den wichtigsten numerischen Features

Kombinationen vorhandene Variablen und das Potenzieren von Variablen können Beziehungen zwischen diesen und der Zielvariablen hervorzubringen, die zuvor nicht klar ersichtlich waren. Wenn eine Variable alleine keine große Korrelation mit dem Ziel hat, so können Multiplikationen zweier Variablen eine bedeutsame Korrelation mit diesem haben. Diese Methode muss mit Bedacht angewendet werden, um Overfitting zu vermeiden. Zudem gibt es die Gefahr des "Fluchs der Dimensionalität"

In [7]:
poly_features = df[['cp', 'thalach', 'ca', 'oldpeak', 'thal', 'age']]
poly_transformer = PolynomialFeatures(degree = 3)
In [8]:
# Trainieren von poly
poly_transformer.fit(poly_features)

# Transformieren der Features
poly_features = poly_transformer.transform(poly_features)
In [9]:
cols = poly_transformer.get_feature_names(['cp', 'thalach', 'ca', 'oldpeak', 'thal', 'age'])
In [10]:
poly_features = pd.DataFrame(poly_features, columns = cols)
poly_features.head()
Out[10]:
1 cp thalach ca oldpeak thal age cp^2 cp thalach cp ca cp oldpeak cp thal cp age thalach^2 thalach ca thalach oldpeak thalach thal thalach age ca^2 ca oldpeak ca thal ca age oldpeak^2 oldpeak thal oldpeak age ... thalach oldpeak thal thalach oldpeak age thalach thal^2 thalach thal age thalach age^2 ca^3 ca^2 oldpeak ca^2 thal ca^2 age ca oldpeak^2 ca oldpeak thal ca oldpeak age ca thal^2 ca thal age ca age^2 oldpeak^3 oldpeak^2 thal oldpeak^2 age oldpeak thal^2 oldpeak thal age oldpeak age^2 thal^3 thal^2 age thal age^2 age^3
0 1.0 3.0 150.0 0.0 2.3 1.0 63.0 9.0 450.0 0.0 6.9 3.0 189.0 22500.0 0.0 345.0 150.0 9450.0 0.0 0.0 0.0 0.0 5.29 2.3 144.9 ... 345.0 21735.0 150.0 9450.0 595350.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.167 5.29 333.27 2.3 144.9 9128.7 1.0 63.0 3969.0 250047.0
1 1.0 2.0 187.0 0.0 3.5 2.0 37.0 4.0 374.0 0.0 7.0 4.0 74.0 34969.0 0.0 654.5 374.0 6919.0 0.0 0.0 0.0 0.0 12.25 7.0 129.5 ... 1309.0 24216.5 748.0 13838.0 256003.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 42.875 24.50 453.25 14.0 259.0 4791.5 8.0 148.0 2738.0 50653.0
2 1.0 1.0 172.0 0.0 1.4 2.0 41.0 1.0 172.0 0.0 1.4 2.0 41.0 29584.0 0.0 240.8 344.0 7052.0 0.0 0.0 0.0 0.0 1.96 2.8 57.4 ... 481.6 9872.8 688.0 14104.0 289132.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.744 3.92 80.36 5.6 114.8 2353.4 8.0 164.0 3362.0 68921.0
3 1.0 1.0 178.0 0.0 0.8 2.0 56.0 1.0 178.0 0.0 0.8 2.0 56.0 31684.0 0.0 142.4 356.0 9968.0 0.0 0.0 0.0 0.0 0.64 1.6 44.8 ... 284.8 7974.4 712.0 19936.0 558208.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.512 1.28 35.84 3.2 89.6 2508.8 8.0 224.0 6272.0 175616.0
4 1.0 0.0 163.0 0.0 0.6 2.0 57.0 0.0 0.0 0.0 0.0 0.0 0.0 26569.0 0.0 97.8 326.0 9291.0 0.0 0.0 0.0 0.0 0.36 1.2 34.2 ... 195.6 5574.6 652.0 18582.0 529587.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.216 0.72 20.52 2.4 68.4 1949.4 8.0 228.0 6498.0 185193.0

5 rows × 84 columns

In [11]:
# Originalspalten entfernen, da diese im Original DF vorhanden sind
poly_features = poly_features.drop(['cp', 'thalach', 'ca', 'oldpeak', 'thal', 'age'], axis=1)
In [12]:
# mit Original DF verschmelzen
df_poly = df.join(poly_features, how='outer')
In [13]:
df_poly.head()
Out[13]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target 1 cp^2 cp thalach cp ca cp oldpeak cp thal cp age thalach^2 thalach ca thalach oldpeak thalach thal ... thalach oldpeak thal thalach oldpeak age thalach thal^2 thalach thal age thalach age^2 ca^3 ca^2 oldpeak ca^2 thal ca^2 age ca oldpeak^2 ca oldpeak thal ca oldpeak age ca thal^2 ca thal age ca age^2 oldpeak^3 oldpeak^2 thal oldpeak^2 age oldpeak thal^2 oldpeak thal age oldpeak age^2 thal^3 thal^2 age thal age^2 age^3
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1 1.0 9.0 450.0 0.0 6.9 3.0 189.0 22500.0 0.0 345.0 150.0 ... 345.0 21735.0 150.0 9450.0 595350.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.167 5.29 333.27 2.3 144.9 9128.7 1.0 63.0 3969.0 250047.0
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1 1.0 4.0 374.0 0.0 7.0 4.0 74.0 34969.0 0.0 654.5 374.0 ... 1309.0 24216.5 748.0 13838.0 256003.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 42.875 24.50 453.25 14.0 259.0 4791.5 8.0 148.0 2738.0 50653.0
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1 1.0 1.0 172.0 0.0 1.4 2.0 41.0 29584.0 0.0 240.8 344.0 ... 481.6 9872.8 688.0 14104.0 289132.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.744 3.92 80.36 5.6 114.8 2353.4 8.0 164.0 3362.0 68921.0
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1 1.0 1.0 178.0 0.0 0.8 2.0 56.0 31684.0 0.0 142.4 356.0 ... 284.8 7974.4 712.0 19936.0 558208.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.512 1.28 35.84 3.2 89.6 2508.8 8.0 224.0 6272.0 175616.0
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 26569.0 0.0 97.8 326.0 ... 195.6 5574.6 652.0 18582.0 529587.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.216 0.72 20.52 2.4 68.4 1949.4 8.0 228.0 6498.0 185193.0

5 rows × 92 columns

Datensatz mit Polynomial Features speichern

In [14]:
#df_poly.to_csv('df_poly.csv', sep=',', encoding='utf-8', index=False)