This kernel will be an exploratory data analysis on the BigQuery-Geotab Intersection Congestion competition data.
Geotab provides a wide variety of aggregate datasets gathered from commercial vehicle telematics devices. Harnessing the insights from this data has the power to improve safety, optimize operations, and identify opportunities for infrastructure challenges.
We have a regression problem for which we must predict 6 target values: three statistics: the 20th, 50th, and 80th percentiles, for each of two metrics: the total time a vehicle stopped at an intersection, and the distance between the intersection and the first place a vehicle stopped while waiting,for each observation in the test set. The given data consists of aggregated trip logging metrics from commercial vehicles, such as semi-trucks. The data have been grouped by intersection, month, hour of day, direction driven through the intersection, and whether the day was on a weekend or not.

Image Source: https://www.geotab.com/blog/traffic-congestion/
Importing libraries..
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.style as style
style.use('seaborn-bright')
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.gridspec as gridspec
Loading data..
train = pd.read_csv('../input/bigquery-geotab-intersection-congestion/train.csv')
test = pd.read_csv('../input/bigquery-geotab-intersection-congestion/test.csv')
mergedData = train.merge(test, 'outer')
train.head()
test.head()
mergedData.info()
Data features:
print(f'Train dataset has {train.shape[0]} rows and {train.shape[1]} columns.')
print(f'Test dataset has {test.shape[0]} rows and {test.shape[1]} columns.')
We have two moderately sized datasets, with the number of observations in the test set being notably higher than that of the train set. The following features are the difference between the two sets:
[col for col in train.columns if col not in test.columns]
The test set contains the same columns from the training set except for the above features. The task is to predict the 20th, 50th, and 80th percentiles of the variables TotalTimeStopped and DistanceToFirstStop. It seems that additional percentile statistics, as well as a TimeFromFirstStop metric have been provided to assist with model building.
Examine the columns that have missing data:
print('Print names of columns in train set with null values, along with number of null values:')
trainNull = [col for col in train.columns if train[col].isnull().any()]
trainNullD = {}
for col in trainNull:
trainNullD.update({col: ['Number missing: ' + str(train[col].isnull().sum()),
'Proportion missing: ' + str(round(train[col].isnull().sum() / train.shape[0],3))]})
for a,b in trainNullD.items():
print(a)
print(b)
print('Print names of columns in test set with null values, along with number of null values:')
testNull = [col for col in test.columns if test[col].isnull().any()]
testNullD = {}
for col in testNull:
testNullD.update({col: ['Number missing: ' + str(test[col].isnull().sum()),
'Proportion missing: ' + str(round(test[col].isnull().sum() / test.shape[0],3))]})
for a,b in testNullD.items():
print(a)
print(b)
The columns with missing data are the same in both train and test sets: EntryStreetName and ExitStreetName. It seems that most columns are without any missing data. For the two that do have missing data, one strategy would be to fill them with their respective modes, or some other constant using the fillna method. But since these columns only have a very small proportion missing, simply removing those observations should also be a safe strategy.
First, let's see how each of the given statistics are distributed.
#Create lists of columns for each group of statistics
TotalTimeStoppedCols = list(train.loc[:,'TotalTimeStopped_p20':'TotalTimeStopped_p80'].columns)
DistanceToFirstStop = list(train.loc[:,'DistanceToFirstStop_p20':'DistanceToFirstStop_p80'].columns)
TimeFromFirstStopCols = list(train.loc[:,'TimeFromFirstStop_p20':'TimeFromFirstStop_p80'].columns)
f1, axes = plt.subplots(3,5, figsize=(20, 10), sharex=False)
for i, col in enumerate(TotalTimeStoppedCols):
sns.distplot(train[col], kde=False, ax=axes[0,i], label = col, color = 'r')
for i, col in enumerate(DistanceToFirstStop):
sns.distplot(train[col], kde=False, ax=axes[1,i], label = col, color = 'g')
for i, col in enumerate(TimeFromFirstStopCols):
sns.distplot(train[col], kde=False, ax=axes[2,i], label = col, color = 'b')
plt.tight_layout()
For each of the three statistics, the number of samples contained in each percentile visibly increases as the histograms are read from left to right, as expected. Let's zoom in on the highest available percentile for each, which encompasses all available data.
#lastPercentiles = ['TotalTimeStopped_p80', 'DistanceToFirstStop_p80', 'TimeFromFirstStop_p80']
f2, axes2 = plt.subplots(1,3, figsize=(20, 8), sharex=False)
sns.distplot(train['TotalTimeStopped_p80'], kde=False, ax=axes2[0], color = 'r' )
sns.distplot(train['DistanceToFirstStop_p80'], kde=False, ax=axes2[1], color = 'g' )
sns.distplot(train['TimeFromFirstStop_p80'], kde=False, ax=axes2[2], color = 'b' )
The data is skewed towards zero for all three statistics. This implies that cars are most likely to pass through the intersection without stopping, out of all possible stopping times, and higher stopping times are increasingly unlikely, which makes sense. TotalTimeStopped by itself should be enough to provide a good idea of how heavy the congestion is. For each congestion data aggregate (each row), a low value of TotalTimeStopped_p80 should thus be representative of normal or good traffic, while a high value would represent slow traffic. As a simple test, the average 80th percentile of total stop time is 24.83 seconds on weekdays, and 18.05 seconds on weekends. This makes sense as most people are more likely to work on weekdays only and rest on weekends, leading to busier streets on weekdays due to commuting to/from work, etc.
mergedData.groupby('Weekend')['TotalTimeStopped_p80'].mean()
From prior driving experience, one can recall that it would take on average more time to make a left/right turn than to drive straight through an intersection. This makes sense since oncoming cars in the opposite lane have the right of way, and one would have to wait for them to pass before turning, leading to blocking/slowing down the cars behind you as well. Let's see if the data agrees with us by creating an 'isSameDirection' variable, which is true if the entry direction and exit direction are the same and false otherwise:
#%%time
train['isSameDirection'] = train['EntryHeading'] == train['ExitHeading']
f3, axes3 = plt.subplots(1,2, figsize=(14, 6))
bar = sns.barplot(x='isSameDirection', y='TotalTimeStopped_p80', data = train, palette = 'rocket',
ax = axes3[0] )
axes3[0].set_title('Barplot of average stop time vs. isSameDirection')
axes3[0].set_xlabel('isSameDirection')
axes3[0].set_ylabel('Avg. of TotalTimeStopped_p80')
strip = sns.stripplot(x='isSameDirection', y='TotalTimeStopped_p80', data = train, palette = 'rocket',
jitter = 0.5, alpha = 0.3, ax = axes3[1])
axt = axes3[1].twinx()
vio = sns.violinplot(x='isSameDirection', y='TotalTimeStopped_p80', data = train, palette = 'rocket',
ax = axt)
axes3[1].set_title('Stripplot of TotalTimeStopped_p80 vs. isSameDirection')
axes3[1].set_xlabel('isSameDirection')
axes3[1].set_ylabel('TotalTimeStopped_p80')
print(train.groupby('isSameDirection')['TotalTimeStopped_p80'].agg('mean'))
print(train['isSameDirection'].value_counts(normalize=True))
It seems that cars approaching intersections drive straight through with a probability of roughly 0.70, and turn in a different direction with a probability of 0.30 (This distribution is very nearly the same for the test set as well).
The barplot supports our earlier guess: that cars travelling through intersections without changing direction would, on average, have lower total stop time.
isSameDirection should be a feature worth keeping for model training. Taking this idea further, it might be worth investigating the same metric for all combinations of entry and exit direction from (N, W, E, S, NW, SW, SE, NE). Let's take a look at the counts of all the entry-exit combinations:
#column 'Path' has extraneous information (i.e. street names), we only want path direction
train['pathDirectionOnly'] = train['EntryHeading'] + '_' + train['ExitHeading']
pathd = pd.DataFrame({'EntryHeading_ExitHeading': train['pathDirectionOnly'].value_counts().index,
'Count': train['pathDirectionOnly'].value_counts()})
f4, axes4 = plt.subplots( figsize=(13, 14))
sns.barplot(x='Count', y='EntryHeading_ExitHeading', data = pathd, ax = axes4, palette='Blues_r')
axes4.set_title('Counts of all entry-exit combinations')
There are a total of 64 entry-exit combinations. The most frequent ones involve no turning (i.e. East -> East: 0 degrees), the next most frequent are simple right/left turns (i.e. East ->North: 90 degrees), followed by turns of abnormal angles (i.e. Northwest -> West: 45 degrees). It may be more informative to convert these 64 entry-exit combinations into the 5 possible angles associated with a turn (0, 45, 90, 135, 180), and check how this feature compares with stop time.

#Define function to get angle between 'EntryHeading' and 'ExitHeading'
def getAngleOfTurn(df):
compassAngle = {'N':0, 'NE':45, 'E':90, 'SE':135, 'S':180,
'SW':225, 'W':270, 'NW':315 }
a = df['EntryHeading']
b = df['ExitHeading']
angle = abs(compassAngle[a] - compassAngle[b])
if angle < 225:
return angle
elif angle == 225:
return 135
elif angle == 270:
return 90
elif angle == 315:
return 45
train['angleOfTurn'] = train.apply(getAngleOfTurn, axis=1)
anglesDf = pd.DataFrame({'angleOfTurn': train.groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean').index,
'Count': train.groupby('angleOfTurn')['angleOfTurn'].agg('count'),
'StopTimeMean' : train.groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean')})
f5, axes5 = plt.subplots( figsize=(12, 8))
dircount = sns.barplot(x='angleOfTurn', y='Count' , data = anglesDf, palette = 'rocket', ax = axes5)
dircount.set_title('Counts of turning angles and associated avg. stop times')
dircount.set_ylabel('Count')
dircount.set_xlabel('Degrees')
a = axes5.twinx()
sns.pointplot(x = 'angleOfTurn', y = 'StopTimeMean', color = 'coral',data = anglesDf, ax = a)
anglesDf.iloc[:,1:]
The frequency of each turn angle is consistent with our previous investigation of 'isSameDirection' - turns of 0 degrees count for around 0.70 of all turns, and frequencies of all other turn angles (45, 90, 135, and 180) sum to roughly 0.30. Plotting the average 80th percentile of TotalTimeStopped (abbreviated as StopTimeMean) for each of the angle groups on the same axis reveals a linear relationship between the magnitude of the angle turned and the time stopped. Interestingly, there are even some drivers that made complete U-turns of 180 degrees at the intersection, going back the way they came. Unsurprisingly, these turned out to have the highest average stop time. 'angleOfTurn' should be a feature worth keeping.

Another space-relevant component of data given to us is latitude and longitude. Latitude provides information about the distance north or south of the equator and ranges from 0 to 90 (0 at the equator, 90 at the North or South Pole), while longitude provides information about the distance east or west of the Prime Meridian, an imaginary line drawn between the North and South Poles, passing through Greenwich, England, and ranges from 0 to 180.
The Earth’s axis is tilted 23.5° to the perpendicular, meaning that the amount of sunlight that a particular latitude receives changes with the seasons. From April to September, the Northern Hemisphere is tilted toward the Sun, where it receives more energy; the Southern Hemisphere receives this additional energy between October and March, when it is tilted toward the Sun. Source: https://enviroliteracy.org/air-climate-weather/climate/latitude-climate-zones/
With this in mind, it should be worthwhile to investigate the effect of climate on traffic by plotting the relationship between latitude and average stop times, organized by seasons. For our purposes here, we'll treat values of latitude as a categorical variable and round them, since our data encompass only four cities, for each of which the values of latitude are sharply distributed.
f7, axes7 = plt.subplots(1,2, figsize=(14, 4))
kde1 = sns.kdeplot(data = train['Latitude'], shade = True, color = 'lightskyblue', ax = axes7[0])
kde1.set_xlabel('Latitude')
kde1.set_title('KDE of Latitude')
kde2 = sns.countplot(x = 'City', data = train, ax = axes7[1], palette = 'Blues',
order = ['Atlanta', 'Philadelphia', 'Chicago', 'Boston'] )
kde2.set_title('Countplot of City')
kde2.set_ylabel('Count')
Since Boston and Chicago have nearly the same latitude, and climate differences are unlikely to be noticeable over such a small range of latitude difference, we will categorize these two cities together.
#%%time
train['LatitudeRounded'] = round(train['Latitude'])
trainSpring = train.loc[train['Month'].isin([3,4,5])]
trainSummer = train.loc[train['Month'].isin([6,7,8])]
trainFall = train.loc[train['Month'].isin([9,10,11])]
trainWinter = train.loc[train['Month'].isin([12,1,2])]
f6, axes6 = plt.subplots(2, 2, figsize=(12, 8))
##############################################################################
a = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSpring,
color = 'palegreen', alpha = 0.3, ax=axes6[0,0])
axt1 = axes6[0,0].twinx()
vio1 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSpring,
color = 'palegreen', alpha = 0.3, ax = axt1)
a.set_title('Spring')
a.set_ylim(0,700)
a.set_autoscaley_on(False)
axt1.set_ylabel('')
axt1.set_ylim(0,700)
axt1.set_autoscaley_on(False)
##############################################################################
b = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSummer,
color = 'salmon', alpha = 0.3, ax=axes6[0,1])
axt2 = axes6[0,1].twinx()
vio2 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSummer,
color = 'salmon', alpha = 0.3, ax = axt2)
b.set_title('Summer')
b.set_ylim(0,700)
b.set_autoscaley_on(False)
axt2.set_ylabel('')
axt2.set_ylim(0,700)
axt2.set_autoscaley_on(False)
##############################################################################
c = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainFall,
color = 'orange', alpha = 0.3, ax=axes6[1,0])
axt3 = axes6[1,0].twinx()
vio3 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainFall,
color = 'orange', alpha = 0.3, ax = axt3)
c.set_title('Fall')
c.set_ylim(0,700)
c.set_autoscaley_on(False)
axt3.set_ylabel('')
axt3.set_ylim(0,700)
axt3.set_autoscaley_on(False)
##############################################################################
d = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainWinter,
color = 'lightblue', alpha = 0.3, ax=axes6[1,1])
axt4 = axes6[1,1].twinx()
vio4 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainWinter,
color = 'lightblue', alpha = 0.3, ax = axt4)
d.set_title('Winter')
d.set_ylim(0,700)
d.set_autoscaley_on(False)
axt4.set_ylabel('')
axt4.set_ylim(0,700)
axt4.set_autoscaley_on(False)
plt.tight_layout()
Unfortunately for our "Spring" dataset, there contains data only from the month of May, with data from March and April missing, and our "Winter" dataset is also missing a month, February (Since data was taken for all months except February, March, and April). Observing each of the remaining seasons, Summer and Fall, it seems that there are differences in stop time among the different locations. However, this cannot simply be attributed to climate differences since there may have been differences in infrastructure etc. among the cities that were the cause instead. In Summer, Latitude 34 corresponds to the highest stop times, while in Fall, Latitude 42 corresponds to the highest stop times. However, this too may be due to indirect effects of season on people's behaviors and is not necessarily evidence of a direct effect of climate on traffic.
It seems that we are unable to distinguish between effects of infrastructure and climate. Let's instead try to investigate the differences in infrastructure among the four cities.
We will revisit our earlier plot of 'Counts of turning angles and associated avg. stop times', but now we will create four different subplots, grouping data by each of the cities.
trainAtlanta = train.loc[train['City'] == 'Atlanta']
trainBoston = train.loc[train['City'] == 'Boston']
trainChicago = train.loc[train['City'] == 'Chicago']
trainPhiladelphia = train.loc[train['City'] == 'Philadelphia']
t = [trainAtlanta, trainBoston, trainChicago, trainPhiladelphia]
n = ['Atlanta', 'Boston', 'Chicago', 'Philadelphia']
I = 0
f8, axes8 = plt.subplots(2,2, figsize=(12, 8))
splots = [axes8[0,0], axes8[0,1], axes8[1,0], axes8[1,1]]
for item in splots:
anglesDf = pd.DataFrame({'angleOfTurn': t[I].groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean').index,
'Count': t[I].groupby('angleOfTurn')['angleOfTurn'].agg('count'),
'StopTimeMean' : t[I].groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean')})
dircount = sns.barplot(x='angleOfTurn', y='Count' , data = anglesDf, palette = 'rocket', ax = item)
dircount.set_title(n[I])
dircount.set_ylabel('Count')
dircount.set_xlabel('Degrees')
a = item.twinx()
sns.pointplot(x = 'angleOfTurn', y = 'StopTimeMean', color = 'coral',data = anglesDf, ax = a)
I = I+1
plt.tight_layout()
Although quite abstract, information on the frequency of turn angles should still provide some insight on differences in infrastructure among the four cities. For all cities except Boston, the trend remains that turn frequencies have the following decreasing order: 0, 90, 45, 135, and 180 degrees. But there are very clear differences in the proportions of each when comparisons are made among the different cities. The exact reasons for these differences may be un-extractable from this abstract view of the data. It may be due to genuine structural differences among the four cities, or it may simply be a matter of where/when Geotab decided to collect their data, or even a combination of the two.
Also, it seems to be generally true that increasing turn angle means increasing stop time, even when we group the data by city. The few exceptions that can be seen above are likely attributed to these large-angle turns being made at non-busy intersections and/or times, removing the need for the driver to stop and wait due to blockage from other cars.
Now let us explore the times of day and see which times are busiest. At the same time, we'll also take a look at the relationship between season and traffic from a different perspective. Since data for the months of Spring are missing except for its last, May, we will consider May to be part of the season that immediately follows: Summer. And for the purpose of maintaining (approximately) equal amounts of data in each season set, we will shift the last month of Summer "down" into the next, and do the same for Fall. After this process, our new seasons contain the following months... Summer: 5,6,7; Fall: 8,9,10; Winter: 11,12,1.
trainSummer2 = train.loc[train['Month'].isin([5,6,7])]
trainFall2 = train.loc[train['Month'].isin([8,9,10])]
trainWinter2 = train.loc[train['Month'].isin([11,12,1])]
f9, axes9 = plt.subplots(3,1, figsize=(12, 9))
c = sns.countplot(x = 'Hour', data = trainSummer2, hue = 'City', palette = 'rocket', ax = axes9[0])
c.set_title('Summer')
c.set_ylabel('Count')
axt = axes9[0].twinx()
p = sns.pointplot(x='Hour', y='TotalTimeStopped_p80', data = trainSummer2, color = 'salmon',ax = axt )
p.set_ylabel('StopTimeMean')
#####
c = sns.countplot(x = 'Hour', data = trainFall2, hue = 'City', palette = 'YlOrRd', ax = axes9[1])
c.set_title('Fall')
c.set_ylabel('Count')
axt = axes9[1].twinx()
p = sns.pointplot(x='Hour', y='TotalTimeStopped_p80', data = trainFall2, color = 'orange', ax = axt )
p.set_ylabel('StopTimeMean')
#####
c = sns.countplot(x = 'Hour', data = trainWinter2, hue = 'City', palette = 'Blues', ax = axes9[2])
c.set_title('Winter')
c.set_ylabel('Count')
axt = axes9[2].twinx()
p = sns.pointplot(x='Hour', y='TotalTimeStopped_p80', data = trainWinter2, color = 'lightblue', ax = axt )
p.set_ylabel('StopTimeMean')
plt.tight_layout()
For all three seasons, there are two notable peaks in average stop time. The first of these occurs at around 8 AM and the second occurs at around 5 PM. Seem familiar? This pattern suggests the commute to-and-from your typical 9-to-5 day job. It makes sense to observe peaks in traffic activity at these two times. There also seems to be a minimum in the amount of data available, centered at around 4 AM. This is also unsurprising, as most people are asleep at this time and so one wouldn't expect high levels of traffic. There doesn't seem to be a significant difference in stop times among the different seasons.
Let's take a closer look at the data contained in different months.
f11, axes11 = plt.subplots(figsize=(14, 5))
m = sns.countplot(y= 'Month', hue = 'City',data = train, palette = 'Blues_r', ax = axes11,
order = [10,9,12,11,8,7,6,1,5], hue_order=['Philadelphia','Boston','Atlanta', 'Chicago'])
#a = axes11.twinx()
#p = sns.pointplot(x='Month', y='TotalTimeStopped_p80', data = train[train['Month'].isin([6,7,8,9,10,11,12])], color = 'lightblue', ax = a )
m.set_title('Countplot of months')
Upon closer inspection, it seems that there are disproportionately few data for the months of January and May compared to the other months. Before, we observed stop times over the time frame of hours and found some sensible trends. Now let's observe stop times over the time frame of months and see if we can find patterns on this larger time scale. We will exclude January and May since they don't contain much data.
trainP = train.loc[train['City'] == 'Philadelphia']
trainB = train.loc[train['City'] == 'Boston']
trainA = train.loc[train['City'] == 'Atlanta']
trainC = train.loc[train['City'] == 'Chicago']
f = plt.figure(figsize = (14,8))
s = plt.subplot2grid((3,2),(0,0), 1, 3,f)
p = plt.subplot2grid((3,2),(1,0), 1, 1,f)
b = plt.subplot2grid((3,2),(1,1), 1, 1,f)
a = plt.subplot2grid((3,2),(2,0), 1, 1,f)
c = plt.subplot2grid((3,2),(2,1), 1, 1,f)
m=sns.countplot(x = 'Month', data = train[train['Month'].isin([6,7,8,9,10,11,12])],
palette = 'Blues', ax = s)
at = s.twinx()
point = sns.pointplot(x='Month', y='TotalTimeStopped_p80', data = train[train['Month'].isin([6,7,8,9,10,11,12])],
color = 'lightblue', ax = at )
m.set_title('Stop times across months (all cities)')
m.set_ylabel('Count')
point.set_ylabel('StopTimeMean')
#####
temp = [[trainP,trainB,trainA,trainC],
[p,b,a,c],
['Philadelphia', 'Boston', 'Atlanta', 'Chicago']]
for i in range(0,4):
count = sns.countplot(x = 'Month', data = temp[0][i][temp[0][i]['Month'].isin([6,7,8,9,10,11,12])],
palette = 'PuBu', ax = temp[1][i])
count.set_title(temp[2][i])
count.set_ylabel('Count')
at = temp[1][i].twinx()
point = sns.pointplot(x='Month', y='TotalTimeStopped_p80', data = temp[0][i][temp[0][i]['Month'].isin([6,7,8,9,10,11,12])],
color = 'lightblue', ax = at )
point.set_ylabel('StopTimeMean')
plt.tight_layout()
Viewing the relationship between month and stop times for all cities, a minimum for the stop time can be seen in the data for July, after which the average stop time climbs upwards reaching a maximum at around October and starts to decrease once again towards the months of November and December. My guess is that as summer time approaches, a significant portion of people take breaks from work and maybe go on vacation, which would explain the minimum at July. And similar reasoning should also apply as we approach the holiday months near the end of the year, as evidenced by the second dip in stop times. The same general trend can be seen if we observe each of the cities individually, with some variance in where the minimum and maximum are for these specific cases. An indirect effect of season/monthly times on average stop time is thus observed. Direct effects of climate on traffic seem impossible to isolate from the given data alone.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.multioutput import MultiOutputRegressor
import eli5
import math
import statistics
Now we will try to do some modelling with simple linear regression. Reload data to undo any feature engineering/processing done in previous sections. We want to establish a baseline model with just the raw features so that we can compare the model's performance after feature engineering. Task: To predict the 20th, 50th, and 80th percentiles of the variables TotalTimeStopped and DistanceToFirstStop.
train = pd.read_csv('../input/bigquery-geotab-intersection-congestion/train.csv')
test = pd.read_csv('../input/bigquery-geotab-intersection-congestion/test.csv')
mergedData = train.merge(test, 'outer')
Summary of Data features:
LinearRegression can only take numerical inputs, so we'll have to do some basic label encoding for the categorical features.
#Select columns with non-numeric datatypes
objectCols = [col for col in train.columns if train[col].dtype == 'object']
objectCols
for col in objectCols:
le = LabelEncoder()
mergedData[col] = mergedData[col].astype(str)
le.fit(mergedData[col])
traincopy[col] = le.transform(train[col])
testcopy[col] = le.transform(test[col])
featureCols = [col for col in train.columns if col in test.columns]
targetCols = ['TotalTimeStopped_p20', 'TotalTimeStopped_p50', 'TotalTimeStopped_p80',
'DistanceToFirstStop_p20', 'DistanceToFirstStop_p50', 'DistanceToFirstStop_p80']
Let's see how the linear regression model performs, using root mean squared error as the performance metric.
#Take list of feature names, a single target column, and fit LinearRegression model to the target column
#10 times (10 fold cross validation), then return list of the 10 rmse scores
def get_scores(features, target, n_folds=10):
scores = []
kf = KFold(n_splits = n_folds, random_state=42)
for trainindices, CVindices in kf.split(traincopy):
lr = LinearRegression()
X_train = traincopy.iloc[trainindices][features]
y_train = traincopy.iloc[trainindices][target]
X_cv = traincopy.iloc[CVindices][features]
y_cv = traincopy.iloc[CVindices][target]
lr.fit(X_train, y_train)
pred = lr.predict(X_cv)
mse = mean_squared_error(y_cv, pred)
rmse = math.sqrt(mse)
scores.append(rmse)
return scores
Get the mean of (the list of 10 rmse scores returned, from above function), for each of the 6 target columns:
#Get mean of each run of get_scores function (6 target columns -> 6 runs total)
def getMeanScores(features):
meanScores = []
for i in tqdm.tqdm_notebook(targetCols):
scorelist = get_scores(features = features, target = i, n_folds = 10)
mean = statistics.mean(scorelist)
meanScores.append(mean)
print(str(i) + ': ' + str(mean) )
return meanScores
meanScores = getMeanScores(features=featureCols)
print('Total mean: ' + str(statistics.mean(meanScores)))
Now let's see the effect of adding two new features: isSameDirection and angleOfTurn
traincopy['isSameDirection'] = train['EntryHeading'] == train['ExitHeading']
traincopy['isSameDirection'] = traincopy['isSameDirection'].astype(int)
traincopy['angleOfTurn'] = train.apply(getAngleOfTurn, axis=1)
newFeatureCols = featureCols + ['isSameDirection', 'angleOfTurn']
meanScores2 = getMeanScores(features=newFeatureCols)
print('Total mean: ' + str(statistics.mean(meanScores2)))
There is a slight decrease in the average rmse for each of the target columns, which indicates that our new features have had a helpful effect, albeit small.
Linear regression is a classic and is great for exploiting linear relationships between the features and targets. Most of our data here, however, has non-linear relationships with the targets. Additionally, although the simple label encoding done earlier served the purpose of converting categorical to numerical inputs for our model, there is a downside to this approach: It subtly conveys an ordinal aspect of our features to the model, when in reality the features may not be ordinal at all.
Now for a more serious modelling attempt, we will use the more modern lightGBM library which is better suited for dealing with nonlinear relationships and has also become famous in the Kaggle community for its efficiency at handling tabular data with high accuracy, even with large datasets containing both numerical and categorical inputs/outputs.
import lightgbm as lgb
from bayes_opt import BayesianOptimization
#Create 5 train sets (5-fold cv)
kf = KFold(n_splits = 5, random_state=42)
train_set = []
cv_set = []
for trainindices, CVindices in kf.split(traincopy):
##### train sets
this_train_split=[]
for i in range(0,6):
traindf = traincopy.iloc[trainindices][newFeatureCols + [targetCols[i]]]
traindf_lgb = lgb.Dataset(traindf, label = traindf[targetCols[i]])
this_train_split.append(traindf_lgb)
##### CV sets
this_cv_split=[]
for i in range(0,6):
CVdf = traincopy.iloc[CVindices][newFeatureCols + [targetCols[i]]]
CVdf_lgb = lgb.Dataset(CVdf, label = CVdf[targetCols[i]])
this_cv_split.append(CVdf_lgb)
train_set.append(this_train_split)
cv_set.append(this_cv_split)
%%time
params = {'application':'regression', 'num_iterations': 1000, 'metric':'rmse', 'DUSE_GPU':1}
results = {'TotalTimeStopped_p20':[],'TotalTimeStopped_p50':[],'TotalTimeStopped_p80':[],
'DistanceToFirstStop_p20':[],'DistanceToFirstStop_p50':[],'DistanceToFirstStop_p80':[]}
for i in tqdm.tqdm_notebook(range(0,5)):
for j in tqdm.tqdm_notebook(range(0,6)):
a=lgb.train(params, train_set[i][j],valid_sets = cv_set[i][j],
verbose_eval = False, early_stopping_rounds=100)
results[targetCols[j]].append(a.best_score['valid_0']['rmse'])
results
resultsMean = {'TotalTimeStopped_p20':-1,'TotalTimeStopped_p50':-1,'TotalTimeStopped_p80':-1,
'DistanceToFirstStop_p20':-1,'DistanceToFirstStop_p50':-1,'DistanceToFirstStop_p80':-1}
for i in targetCols:
resultsMean[i] = (statistics.mean(results[i]))
for i in resultsMean:
print(i + ': ' +str(resultsMean[i]))
print('Total mean: ' + str(statistics.mean(resultsMean.values())))
# 4.845
LightGBM is substantially more suited for this task than simple linear regression, as shown in the results above. Decent performance was achieved even without any hyper-parameter tuning. Now let's see if we can improve the performance by adjusting a few of the hyperparameters using Bayesian optimization.
def lgb_eval(num_leaves, bagging_fraction, feature_fraction, lambda_l1, lambda_l2):
params = {}
params['application'] = 'regression'
params['num_iterations'] = 1000
params['num_leaves'] = int(round(num_leaves))
#params['max_depth'] = int(round(max_depth))
params['bagging_fraction'] = max(min(bagging_fraction, 1), 0)
params['feature_fraction']= max(min(feature_fraction, 1), 0)
params['lambda_l1'] = max(lambda_l1, 0)
params['lambda_l2'] = max(lambda_l2, 0)
params['metric'] = 'rmse'
params['DUSE_GPU'] = 1
results2 = {'TotalTimeStopped_p20':[],'TotalTimeStopped_p50':[],'TotalTimeStopped_p80':[],
'DistanceToFirstStop_p20':[],'DistanceToFirstStop_p50':[],'DistanceToFirstStop_p80':[]}
for i in tqdm.tqdm_notebook(range(0,5)):
for j in range(0,6):
b=lgb.train(params, train_set=train_set[i][j] , valid_sets = cv_set[i][j],verbose_eval = False,
early_stopping_rounds=100)
results2[targetCols[j]].append(b.best_score['valid_0']['rmse'])
resultsMean2 = {'TotalTimeStopped_p20':-1,'TotalTimeStopped_p50':-1,'TotalTimeStopped_p80':-1,
'DistanceToFirstStop_p20':-1,'DistanceToFirstStop_p50':-1,'DistanceToFirstStop_p80':-1}
for i in targetCols:
resultsMean2[i] = (statistics.mean(results2[i]))
for i in resultsMean2:
print(i + ': ' +str(resultsMean2[i]))
totalMean2 = statistics.mean(resultsMean2.values())
print("Total mean: " + str(totalMean2))
return (1/totalMean2)
# Bayesian optimization will maximize (1/rmse) -> equivalent of minimizing rmse
lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (24, 45),
'bagging_fraction': (0.8, 1),
'feature_fraction': (0.1, 1),
'lambda_l1': (0, 5),
'lambda_l2': (0, 3),}, random_state=0)
%%time
lgbBO.maximize(n_iter=10)
for i, res in enumerate(lgbBO.res):
print("Iteration {}: \n\t{}".format(i, res))
lgbBO.max
After 15 iterations, Bayesian optimization has found the above parameters to be associated with the best performance, improving the base mean rmse of 4.845 to 4.648 (= 1/'target'). LightGBM supports implementation of ensembling through bootstrap aggregation of a fraction of the data ('bagging_fraction'), and/or a fraction of the feature columns ('feature_fraction') for each iteration. 'lambda_l1' and 'lambda_l2' are regularization parameters to prevent overfitting, and 'num_leaves' is the number of leaves in the gradient boosting tree, which also helps to control under/over fitting. Further improvements could still be made by exploring other hyperparameters and additional feature engineering, and even possibly incorporating external weather data (i.e. monthly rainfall in cities, etc.).