Da die Zusammenhänge zwishen den Prediktoren und dem Preis linear zu sein scheinen, wird als erstes der Algorithmus Linear Regression getestet. Der Vorteil dieses Algorithmus ist, dass er leicht zu interpretieren ist. Nötig ist hier allerdings, dass die Daten skaliert werden.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('Solarize_Light2')
import folium

import os
os.chdir('D:\Data\Projects\Regression\Taxi Fare Prediction_Linear Regression')

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('train_clean_features.csv')
df.head()

Train und Test¶

x = df.drop(['fare_amount', 'pickup_datetime'], axis=1)
y = df.fare_amount

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3657294, 11), (1219099, 11), (3657294,), (1219099,))

Skalieren¶

sc = StandardScaler()
sc.fit(X_train)
X_train =sc.transform(X_train);
X_test =sc.transform(X_test);

Definition für RMSE¶

def rmse(pred, true):
    return np.sqrt(((pred - true) ** 2).mean())

Baseline¶

from sklearn.dummy import DummyRegressor

dummy = DummyRegressor(strategy='mean')

# Model trainieren
dummy.fit(X_train, y_train)

# predict und MAE
dummpred = dummy.predict(X_test)
mean_absolute_error(y_test, dummpred)
rmse(y_test, dummpred)

9.734292771468382

Lineare Regression¶

lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

y_pred = lr.predict(X_test)

mean_absolute_error(y_test, y_pred)
rmse(y_test, y_pred)

5.425372571151715

Besser als die Baseline!

lr.intercept_

11.351723205736699

coeff = pd.DataFrame({'Feature': x.columns, 'Coeff': list(lr.coef_)})

coeff= coeff.sort_values('Coeff', ascending=False)
coeff

import statsmodels.api as sm

model = sm.OLS(y_train, X_train)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            fare_amount   R-squared:                       0.289
Model:                            OLS   Adj. R-squared:                  0.289
Method:                 Least Squares   F-statistic:                 1.353e+05
Date:                Fri, 05 Jul 2019   Prob (F-statistic):               0.00
Time:                        13:15:16   Log-Likelihood:            -1.4454e+07
No. Observations:             3657294   AIC:                         2.891e+07
Df Residuals:                 3657283   BIC:                         2.891e+07
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.5408      0.008     69.531      0.000       0.526       0.556
x2            -0.5782      0.008    -75.781      0.000      -0.593      -0.563
x3             0.3126      0.008     41.244      0.000       0.298       0.328
x4            -0.4639      0.008    -60.163      0.000      -0.479      -0.449
x5             0.0377      0.007      5.726      0.000       0.025       0.051
x6             0.9997      0.007    150.662      0.000       0.987       1.013
x7             0.2566      0.007     38.675      0.000       0.244       0.270
x8             0.0121      0.007      1.834      0.067      -0.001       0.025
x9             0.0635      0.007      9.615      0.000       0.051       0.076
x10           -0.0172      0.007     -2.605      0.009      -0.030      -0.004
x11            7.4187      0.008    951.748      0.000       7.403       7.434
==============================================================================
Omnibus:                  3494350.798   Durbin-Watson:                   0.375
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      61565815951.408
Skew:                           2.989   Prob(JB):                         0.00
Kurtosis:                     638.589   Cond. No.                         1.96
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Verbesserung der Vorhersage mit dem Random Forest Regressor¶

#rf = RandomForestRegressor()
#rf.fit(X_train, y_train)
#y_pred_rf = rf.predict(X_test)
#rmse(y_test, y_pred_rf)

C:\Users\Leo\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

3.819548577248603

Der Random Forest Regressor ist noch ein wenig besser.

# Sammlung der fitted sub-estimators
rf.estimators_

[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=557674527, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1983337570, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=2042765934, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1798310804, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1298323308, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=234085700, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=675433218, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=570578621, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1254609015, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1537777518, splitter='best')]

rf.decision_path

<bound method BaseForest.decision_path of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)>

Plotten der wahren und vorhergesagen Preise¶

y_test, y_pred_rf in ein DF
Die ersten 100 Werte in seaborn plotten

Die Reihenfolge bei einer Liste ist unveränderlich

prices = pd.DataFrame({'Real': y_test[:100], 'Predicted': list(y_pred_rf)[:100]})

plt.figure(figsize=(10,8))
sns.kdeplot(prices.Real)
sns.kdeplot(prices.Predicted);

Speichern des Models¶

Speichern des trainierten Models

import pickle

# save the model to disk
filename = 'finalized_rf.sav'
pickle.dump(rf, open(filename, 'wb'))

	fare_amount	pickup_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	year	month	day	hour	minute	dist
0	4.5	2009-06-15 17:26:21+00:00	-73.844311	40.721319	-73.841610	40.712278	1	2009	6	15	17	26	1.030765
1	16.9	2010-01-05 16:52:16+00:00	-74.016048	40.711303	-73.979268	40.782004	1	2010	1	5	16	52	8.450145
2	5.7	2011-08-18 00:35:00+00:00	-73.982738	40.761270	-73.991242	40.750562	2	2011	8	18	0	35	1.389527
3	7.7	2012-04-21 04:30:42+00:00	-73.987130	40.733143	-73.991567	40.758092	1	2012	4	21	4	30	2.799274
4	5.3	2010-03-09 07:51:00+00:00	-73.968095	40.768008	-73.956655	40.783762	1	2010	3	9	7	51	1.999160

	Feature	Coeff
10	dist	7.418728
5	year	0.999735
0	pickup_longitude	0.540838
2	dropoff_longitude	0.312643
6	month	0.256595
8	hour	0.063478
4	passenger_count	0.037717
7	day	0.012083
9	minute	-0.017158
3	dropoff_latitude	-0.463891
1	pickup_latitude	-0.578175