Da die Zusammenhänge zwishen den Prediktoren und dem Preis linear zu sein scheinen, wird als erstes der Algorithmus Linear Regression getestet. Der Vorteil dieses Algorithmus ist, dass er leicht zu interpretieren ist. Nötig ist hier allerdings, dass die Daten skaliert werden.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('Solarize_Light2')
import folium

import os
os.chdir('D:\Data\Projects\Regression\Taxi Fare Prediction_Linear Regression')

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
In [2]:
df = pd.read_csv('train_clean_features.csv')
df.head()
Out[2]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count year month day hour minute dist
0 4.5 2009-06-15 17:26:21+00:00 -73.844311 40.721319 -73.841610 40.712278 1 2009 6 15 17 26 1.030765
1 16.9 2010-01-05 16:52:16+00:00 -74.016048 40.711303 -73.979268 40.782004 1 2010 1 5 16 52 8.450145
2 5.7 2011-08-18 00:35:00+00:00 -73.982738 40.761270 -73.991242 40.750562 2 2011 8 18 0 35 1.389527
3 7.7 2012-04-21 04:30:42+00:00 -73.987130 40.733143 -73.991567 40.758092 1 2012 4 21 4 30 2.799274
4 5.3 2010-03-09 07:51:00+00:00 -73.968095 40.768008 -73.956655 40.783762 1 2010 3 9 7 51 1.999160

Train und Test

In [3]:
x = df.drop(['fare_amount', 'pickup_datetime'], axis=1)
y = df.fare_amount
In [4]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[4]:
((3657294, 11), (1219099, 11), (3657294,), (1219099,))

Skalieren

In [5]:
sc = StandardScaler()
sc.fit(X_train)
X_train =sc.transform(X_train);
X_test =sc.transform(X_test);

Definition für RMSE

In [15]:
def rmse(pred, true):
    return np.sqrt(((pred - true) ** 2).mean())

Baseline

In [16]:
from sklearn.dummy import DummyRegressor

dummy = DummyRegressor(strategy='mean')

# Model trainieren
dummy.fit(X_train, y_train)

# predict und MAE
dummpred = dummy.predict(X_test)
mean_absolute_error(y_test, dummpred)
rmse(y_test, dummpred)
Out[16]:
9.734292771468382

Lineare Regression

In [6]:
lr = LinearRegression()
lr.fit(X_train, y_train)
Out[6]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [8]:
y_pred = lr.predict(X_test)
In [17]:
mean_absolute_error(y_test, y_pred)
rmse(y_test, y_pred)
Out[17]:
5.425372571151715

Besser als die Baseline!

In [19]:
lr.intercept_
Out[19]:
11.351723205736699
In [34]:
coeff = pd.DataFrame({'Feature': x.columns, 'Coeff': list(lr.coef_)})
In [38]:
coeff= coeff.sort_values('Coeff', ascending=False)
coeff
Out[38]:
Feature Coeff
10 dist 7.418728
5 year 0.999735
0 pickup_longitude 0.540838
2 dropoff_longitude 0.312643
6 month 0.256595
8 hour 0.063478
4 passenger_count 0.037717
7 day 0.012083
9 minute -0.017158
3 dropoff_latitude -0.463891
1 pickup_latitude -0.578175
In [39]:
import statsmodels.api as sm
In [41]:
model = sm.OLS(y_train, X_train)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            fare_amount   R-squared:                       0.289
Model:                            OLS   Adj. R-squared:                  0.289
Method:                 Least Squares   F-statistic:                 1.353e+05
Date:                Fri, 05 Jul 2019   Prob (F-statistic):               0.00
Time:                        13:15:16   Log-Likelihood:            -1.4454e+07
No. Observations:             3657294   AIC:                         2.891e+07
Df Residuals:                 3657283   BIC:                         2.891e+07
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.5408      0.008     69.531      0.000       0.526       0.556
x2            -0.5782      0.008    -75.781      0.000      -0.593      -0.563
x3             0.3126      0.008     41.244      0.000       0.298       0.328
x4            -0.4639      0.008    -60.163      0.000      -0.479      -0.449
x5             0.0377      0.007      5.726      0.000       0.025       0.051
x6             0.9997      0.007    150.662      0.000       0.987       1.013
x7             0.2566      0.007     38.675      0.000       0.244       0.270
x8             0.0121      0.007      1.834      0.067      -0.001       0.025
x9             0.0635      0.007      9.615      0.000       0.051       0.076
x10           -0.0172      0.007     -2.605      0.009      -0.030      -0.004
x11            7.4187      0.008    951.748      0.000       7.403       7.434
==============================================================================
Omnibus:                  3494350.798   Durbin-Watson:                   0.375
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      61565815951.408
Skew:                           2.989   Prob(JB):                         0.00
Kurtosis:                     638.589   Cond. No.                         1.96
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Verbesserung der Vorhersage mit dem Random Forest Regressor

In [42]:
#rf = RandomForestRegressor()
#rf.fit(X_train, y_train)
#y_pred_rf = rf.predict(X_test)
#rmse(y_test, y_pred_rf)
C:\Users\Leo\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Out[42]:
3.819548577248603

Der Random Forest Regressor ist noch ein wenig besser.

In [45]:
# Sammlung der fitted sub-estimators
rf.estimators_
Out[45]:
[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=557674527, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1983337570, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=2042765934, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1798310804, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1298323308, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=234085700, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=675433218, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=570578621, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1254609015, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=1537777518, splitter='best')]
In [47]:
rf.decision_path
Out[47]:
<bound method BaseForest.decision_path of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)>

Plotten der wahren und vorhergesagen Preise

y_test, y_pred_rf in ein DF
Die ersten 100 Werte in seaborn plotten

Die Reihenfolge bei einer Liste ist unveränderlich

In [57]:
prices = pd.DataFrame({'Real': y_test[:100], 'Predicted': list(y_pred_rf)[:100]})
In [61]:
plt.figure(figsize=(10,8))
sns.kdeplot(prices.Real)
sns.kdeplot(prices.Predicted);

Speichern des Models

Speichern des trainierten Models

In [43]:
import pickle
In [48]:
# save the model to disk
filename = 'finalized_rf.sav'
pickle.dump(rf, open(filename, 'wb'))