Linear Regression Through Code - Part 1
In an earlier blog post, I have spoken about "What is Regression?" and the basic linear equation too. This is one of the simplest algorithms but has solved many problems historically and still is very powerful for many use-cases. The explainability of the predictions is high and hence favoured in the data science community.
When you have one independent variable and one dependent variable, we call it simple linear regression. Practically, this is rarely used as no problem is so uni-dimensional. However, if we have multiple independent variables that impact the dependent or target variable, it is called a Multi-Linear Regression (MLR).
This post will walk you through one of the problem statements that is suitable for Multi-Linear Regression and how it can be solved using MLR.
The Problem
This is one of the well-known problems of predicting the demand for a Bike Sharing system in a particular city. A problem that many beginners work on.
The complete data and code are available here: https://github.com/saigeethamn/DataScience-LinearRegression
Solution:
The solution provided here takes you through the entire process starting from understanding the data, to validation of the model using the test data. This being one of my initial posts on an ML model, I plan to cover the whole process giving a peek into the entire model building process.
In Part 1 (this post), I will only describe the preliminary data understanding, exploratory data analysis, and data preparation parts required for the building of the model. This part would be very similar for most algorithms.
In Part 2, I will walk you through the actual model development, validation against test data, and also the validation of the assumptions of Linear Regression.
The whole solution is in python, easy to understand even if you are not familiar with that language.
Understanding Data
The first step is always to get familiar with your data before you decide what type of algorithm can help you. How do you get familiar with the data?
First, examine the sample data
# Read the data and do a preliminary inspection
bikes = pd.read_csv("day.csv")
bikes.head()
Next, see the summary of the data
You explore the data and its data types, along with the summary of all the numerical columns using commands like
bikes.shape
# (730,16)
bikes.info()
bikes.describe()
Finally, plot the target data
Here the target column that you are trying to predict is the count of bikes on any specific day that will be hired. So you can plot the target data against the dates and here's what you see
# Just viewing the general trend by Time
plt.figure(figsize = [25,5])
plt.plot(bikes['dteday'], bikes['cnt'])
plt.show()
There is a need to understand the data through the descriptions in the data dictionary that is provided at the end of this post in the Appendix
Once we have a preliminary understanding we go for Data Cleansing
Data Cleansing
What does data cleansing involve? Why is it necessary? No machine learning can help if the data is unclean. Garbage-in garbage-out is what you get in any model. There are also times when the model will just fail to execute if you have missing data or null values. There are other times that the whole model can become useless because of outliers completely affecting the learning process in the model.
Hence this is a very important step in any model development.
What does data cleaning involve? Some of the common steps are:
Drop unnecessary data
Inspect for null values and take corrective measures
Transform categorical variables
Check for Outliers and take corrective measures again
Drop unnecessary data
Not all data that we have, influence the target variable always. We have to figure out which of the variables have no relation, based on domain knowledge and remove those unnecessary variables.
In fact, we will see later that even those variables that have very little influence on the target would better be removed. We would ideally want the most significant variables so that we get actionable insights from the model.
There might be other use-cases where we do not want actionable insights but we would prefer highly accurate predictions and in such cases, we deal with unnecessary data differently and probably more liberally.
In this case, we drop the columns 'instant' and 'dteday' as the 'instant' variable is just an index to rows. Also, we are not doing a time series analysis here and hence do not need to use date.
bikes.drop(['instant','dteday'], axis=1, inplace=True)
Inspect for null values and take corrective measures
This data does not have null values and so nothing to do here. How we check if it has null values is with this simple statement.
bikes.isnull().sum()
Transform categorical variables
The columns 'season', 'yr', 'mnth', 'weekday', 'weathersit' are all categorical variables. You check on the distinct values they have and convert them to meaningful category names.
For example, 'season' has numbers 1 to 4 indicating the four seasons. This can be converted to spring, summer, fall, and winter based on the data dictionary, as shown here. Here is a statement that gives you the count of each type of season.
bikes['season'].value_counts()
This gives the result that there are 180 days of spring, 184 days of summer and so on in the two years of data, we have.
You convert it to meaningful categories with:
bikes['season'] = bikes['season'].map({1:'spring',2:'summer',3:'fall',4:'winter'})
You repeat this for the rest of the categorical variables as well.
Note that using numbers for categories gives a sense of order or importance. The algorithm might say 1 is smaller than 2 which is smaller than 3 etc. Or it may think that the most important category is Spring as it is 1. To avoid this kind of 'Order' to nominal variables, you convert them into category names.
Check for Outliers
Finally, you check for outliers in the numerical data. A box plot is a very good way to check for outliers (too far away points). Any points beyond the whiskers of the plot are outliers. ie. An outlier is any value that lies more than one and a half times the length of the box from either end of the box.
You plot and visually inspect to see if there are outliers.
cont_vars = ['temp','atemp','hum','windspeed','cnt']
plt.figure(figsize = [15,8])
i = 1
for var in cont_vars:
plt.subplot(2,3,i)
sns.boxplot(bikes[var])
i += 1
plt.show()
You notice that there are hardly any outliers and hence no treatment is necessary.
Exploratory Data Analysis
This is the stage where you really get a proper understanding of the data, visualize it in various ways and get a feel for what might be the most important variables that affect the target variable. Are there relationships that seem obvious or is there some pattern that beats your domain understanding?
This helps you create your own hypothesis that you can validate later.
This part is done through univariate and bivariate analysis of all the data on hand.
Univariate Analysis
Here we look at all continuous and categorical variables to understand the spread and the behavior of the potential features independently. If you want to understand the types of variables, please read this post on the different types of variables.
A picture is worth a thousand words and hence we plot graphs for all the variables to understand their characteristics.
We typically use count plot for the categorical variables and dist plot for continuous variables.
cat_vars = ['season','yr','mnth','weekday','workingday','holiday','weathersit']
plt.figure(figsize=[15, 10])
plt.subplots_adjust(hspace=0.50,wspace=0.25)
for i,var in enumerate(cat_vars):
plt.subplot(3,3,i+1)
sns.countplot(bikes[var])
plt.xticks(rotation=45)
plt.show()
From here what we notice are some obvious things like the count of data in seasons, months, and days of the week is as per the number of days. No Surprises here. So is it with years. Working day, holiday distribution is also as expected with a huge skew.
The only insight we get here that the clear days are 400+, misty days are 200+ and light rain in a few days and there are no days with heavy rain in the 2 years of data we have.
Similarly, you plot dist plot for continuous variables:
Here too you get a fair idea about the distribution of real temperatures and the temperature felt, humidity, wind speed, and the frequency of bikes rented. As expected, most of these should be close to a normal distribution and they are, with slight variations.
Bivariate Analysis
This is done to understand the relationship between two variables or features and also the relationship of all data with the target variable.
Here, you try to check the relationship between combinations of continuous-continuous, continuous-categorical, and categorical-categorical variables.
A heatmap and a pair-plot are very good visual tools for comparing the values of continuous variables and their relationship with other continuous variables.
Continuous-continuous variables
cont_vars = ['temp','atemp','hum','windspeed','cnt']
plt.figure(figsize = [10, 5])
sns.heatmap(bikes[cont_vars].corr(),annot=True, cmap='Reds')
plt.show()
You can see that temp and atemp are highly correlated and hence one of them can be dropped.
Also, the count of bikes hired is positively correlated to temp and negatively to humidity and windspeed, which is as one would expect.
Similarly, you can plot a pair plot to see how the correlation is spread.
cont_vars = ['temp','atemp','hum','windspeed','cnt']
sns.pairplot(data=bikes, vars = cont_vars)
plt.show()
Similar conclusions as that from the heatmap can be drawn from this too.
The linear relationship between temp and atemp is very visible. Count and temp also seem to show some sort of a linear relationship.
Continuous-Categorical Variables
Box Plots are very handy to understand the relationships between categorical variables and continuous variables.
cat_vars = ['season','yr','mnth','holiday','weekday','workingday','weathersit']
plt.subplots_adjust(hspace=0.30,wspace=0.35)
for i,col in enumerate(cat_vars):
plt.subplot(3,3,i+1)
sns.boxplot(x = col, y = 'cnt', data = bikes)
plt.xticks(rotation=45)
plt.show()
Here are a few takeaways from this plot:
Summer and fall see much higher usage of bikes than Spring, with winter being in-between. Monthly usage shows a similar pattern
There has been significant growth in the second year.
The median usage of non-holidays seems higher while the spread seems larger on holidays
The median usage on all days of the week seems very similar while the spread varies a bit
The weather seems to have a major impact on usage. Even a light rain brings down the usage pretty drastically.
Here since the target variable is a numeric variable, I have not done a bivariate analysis of categorical-categorical variables.
When you need to do that, you could use grouped bar charts or stacked bar charts to get some insights.
Data Preparation
This consists of
Creating dummy variables for multi-level categorical variables
Split the data into train and test sets
Rescale the features in the train set
Create Dummy Variables
Dummy variables are created so that nominal variables too can be used to contribute towards the regression model. If you want to understand the types of variables, please read this post on the different types of variables.
Some of the nominal variables (categorical) are season, month, weekday, weathersit, and year.
Holiday and working day are dichotomous variables and do not need any further treatment.
Using pandas get_dummies to create the dummy variables:
dum_cat_vars = ['season','mnth','weekday','weathersit','yr']
for var in dum_cat_vars:
temp_var = pd.get_dummies(bikes[var], drop_first = True)
bikes = pd.concat([bikes, temp_var], axis = 1)
bikes.drop(var, axis=1, inplace = True)
Split the data into Train and Test data
Before you scale the data, you need to split the data into train and test, as you always scale with only train data. You do not want to leak information from test data into the scaling process.
np.random.seed(0)
bikes_train, bikes_test = train_test_split(bikes, train_size = 0.7, random_state = 100)
Scale the numeric features
I have used the MinMaxScaler from sklearn. You could use standard scaler too if you want to.
The importance of scaling itself, I will discuss in a separate post.
scaler = MinMaxScaler()
# Apply scaler() to all the columns except the 'binary' columns # and 'dummy' variables
num_vars = ['temp', 'atemp', 'hum', 'windspeed','cnt']
bikes_train[num_vars] = scaler.fit_transform(bikes_train[num_vars])
Before you get into building the model, you have to separate the target variable and the independent variables using the standard notation of X and y.
y_train = bikes_train.pop('cnt')
X_train = bikes_train
With this, you have done all the data cleansing and data preparation to start building your model. The variable/feature selection process, model creation, and validation will all be discussed in the next post.
In the meantime, happy time with understanding the preliminaries to any modeling process.
Â
Appendix
The Data dictionary:
- instant: record index
- dteday : date
- season : season (1:spring, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2018, 1:2019)
- mnth : month ( 1 to 12)
- holiday : weather day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : temperature in Celsius
- atemp: feeling temperature in Celsius
- hum: humidity
- windspeed: wind speed
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered