Simple linear regression | Learn simple linear regression (SLR)

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp


Let's start by taking a short statement of the problem.

Problem statement: create a simple linear regression model to predict salary increase using years of experience.

Start by importing the necessary libraries

the libraries needed are pandas, NumPy for working with data frames, matplotlib, seaborn for visualizations and sklearn, statsmodels to construct regression models.

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import probplot
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Once we are done with importing libraries, we create a pandas data frame from the CSV file

df = pd.read_csv(“Salary_Data.csv”)

Perform EDA (exploratory data analysis)

The basic steps of EDA are:

  1. Identify the number of features or columns
  2. Identify the characteristics or columns
  3. Identify the size of the data set
  4. Identification of the data types of the characteristics
  5. Checking if the dataset has empty cells
  6. Identify the number of empty cells by characteristics or columns
  • Handling missing values ​​and outliers
  • Coding of categorical variables
  • Univariate graphical analysis, bivariado
  • Normalization and scaling
len(df.columns) # identify the number of features
df.columns # idenfity the features
df.shape # identify the size of of the dataset
df.dtypes # identify the datatypes of the features
df.isnull().values.any() # checking if dataset has empty cells
df.isnull().sum() # identify the number of empty cells

Our dataset has two columns: Years of experience, Salary. And both are of float data type. Have 30 records and we have no null values ​​or outliers in our dataset.

Univariate graphical analysis

For univariate analysis, have Histogram, density graph, box plot O fiddle, Y Normal QQ chart. They help us understand the distribution of data points and the presence of outliers.

A fiddle diagram is a method of plotting numerical data. It is similar to a box plot, with the addition of a grain density diagram rotated on each side.

Python code:

# Histogram
# We can use either plt.hist or sns.histplot
plt.hist(df['YearsExperience'], density=False)
plt.title("Histogram of 'YearsExperience'")
plt.hist(df['Salary'], density=False)
plt.title("Histogram of 'Salary'")

# Density plot
sns.distplot(df['YearsExperience'], kde=True)
plt.title("Density distribution of 'YearsExperience'")
sns.distplot(df['Salary'], kde=True)
plt.title("Density distribution of 'Salary'")

# boxplot or violin plot
# A violin plot is a method of plotting numeric data. It is similar to a box plot, 
# with the addition of a rotated kernel density plot on each side
# plt.boxplot(df['YearsExperience'])
# plt.title("Boxlpot of 'YearsExperience'")
plt.title("Violin plot of 'YearsExperience'")
# plt.boxplot(df['Salary'])
# plt.title("Boxlpot of 'Salary'")
plt.title("Violin plot of 'Salary'")

# Normal Q-Q plot
probplot(df['YearsExperience'], plot=plt)
plt.title("Q-Q plot of 'YearsExperience'")
probplot(df['Salary'], plot=plt)
plt.title("Q-Q plot of 'Salary'")
Univariate graphical representations

From the graphic representations above, we can say that there are no outliers in our data, Y YearsExperience looks like normally distributed, and Salary doesn't look normal. We can verify this using Shapiro Test.

Python code:

# Def a function to run Shapiro test

# Defining our Null, Alternate Hypothesis
Ho = 'Data is Normal'
Ha="Data is not Normal"

# Defining a significance value
alpha = 0.05
def normality_check(df):
    for columnName, columnData in df.iteritems():
        print("Shapiro test for {columnName}".format(columnName=columnName))
        res = stats.shapiro(columnData)
#         print(res)
        pValue = round(res[1], 2)
        # Writing condition
        if pValue > alpha:
            print("pvalue = {pValue} > {alpha}. We fail to reject Null Hypothesis. {Ho}".format(pValue=pValue, alpha=alpha, Ho=Ho))
            print("pvalue = {pValue} <= {alpha}. We reject Null Hypothesis. {Ha}".format(pValue=pValue, alpha=alpha, Ha=Ha))
# Drive code

Our graphics instinct was correct. Years Experience is normally distributed and salary is not normally distributed.

Bivariate display

for numeric data vs numeric data, we can draw the following graphs

  1. Scatter plot
  2. Line graph
  3. Heat map for correlation
  4. Joint plot

Python code for multiple parcels:

# Scatterplot & Line plots
sns.scatterplot(data=df, x="YearsExperience", y="Salary", hue="YearsExperience", alpha=0.6)
plt.title("Scatter plot")
sns.lineplot(data=df, x="YearsExperience", y="Salary")
plt.title("Line plot of YearsExperience, Salary")
plt.title('Line Plot')
Scatter and line plots

# heatmap
plt.figure(figsize=(10, 10))
plt.subplot(1, 2, 1)
sns.heatmap(data=df, cmap="YlGnBu", annot = True)
plt.title("Heatmap using seaborn")
plt.subplot(1, 2, 2)
plt.imshow(df, cmap ="YlGnBu")
plt.title("Heatmap using matplotlib")
Heat map
# Joint plot
sns.jointplot(x = "YearsExperience", y = "Salary", kind = "reg", data = df)
plt.title("Joint plot using sns")
# kind can be hex, kde, scatter, reg, hist. When kind='reg' it shows the best fit line.
Joint plot

Check if there is any correlation between the variables using df.corr ()

print("Correlation: "+ 'n', df.corr()) # 0.978 which is high positive correlation
# Draw a heatmap for correlation matrix
sns.heatmap(df.corr(), annot=True)
Heatmap of the correlation matrix

correlation = 0,98, which is a high positive correlation. This means that the dependent variable increases as the independent variable increases..


As we can see, there is a big difference between the values ​​of the YearsExperience columns, Salary. We can use Normalization to change the values ​​of numeric columns in the dataset to use a common scale, without distorting differences in value ranges or losing information.

We use sklearn.preprocessing.Normalize to normalize our data. Returns values ​​between 0 Y 1.

# Create new columns for the normalized values
df['Norm_YearsExp'] = preprocessing.normalize(df[['YearsExperience']], axis=0)
df['Norm_Salary'] = preprocessing.normalize(df[['Salary']], axis=0)

Linear regression using scikit-learn

LinearRegression(): LinearRegression conforms to a linear model with coefficients β = (β1,…, βp) to minimize the residual sum of squares between the observed targets in the data set and the targets predicted by the linear approximation.

def regression(df):
#     defining the independent and dependent features
    x = df.iloc[:, 1:2]
    y = df.iloc[:, 0:1] 
    # print(x,y)

    # Instantiating the LinearRegression object
    regressor = LinearRegression()
    # Training the model,y)

    # Checking the coefficients for the prediction of each of the predictor
    print('n'+"Coeff of the predictor: ",regressor.coef_)
    # Checking the intercept
    print("Intercept: ",regressor.intercept_)

    # Predicting the output
    y_pred = regressor.predict(x)
#     print(y_pred)

    # Checking the MSE
    print("Mean squared error(MSE): %.2f" % mean_squared_error(y, y_pred))
    # Checking the R2 value
    print("Coefficient of determination: %.3f" % r2_score(y, y_pred)) # Evaluates the performance of the model # says much percentage of data points are falling on the best fit line
    # visualizing the results.
    plt.figure(figsize=(18, 10))
    # Scatter plot of input and output values
    plt.scatter(x, y, color="teal")
    # plot of the input and predicted output values
    plt.plot(x, regressor.predict(x), color="Red", linewidth=2 )
    plt.title('Simple Linear Regression')
# Driver code
regression(df[['Salary', 'YearsExperience']]) # 0.957 accuracy
regression(df[['Norm_Salary', 'Norm_YearsExp']]) # 0.957 accuracy

We achieve a precision of 95,7% con scikit-learn, but there is not much scope to understand the detailed information about the relevance of the features of this model. So let's build a model using statsmodels.api, statsmodels.formula.api

Linear regression using statsmodel.formula.api (smf)

Predictors in statsmodels.formula.api must be listed individually. And in this method, a constant is automatically added to the data.

def smf_ols(df):
    # defining the independent and dependent features
    x = df.iloc[:, 1:2]
    y = df.iloc[:, 0:1] 
#     print(x)
    # train the model
    model = smf.ols('y~x', data=df).fit()
    # print model summary
    # Predict y
    y_pred = model.predict(x)
#     print(type(y), type(y_pred))
#     print(y, y_pred)

    y_lst = y.Salary.values.tolist()
#     y_lst = y.iloc[:, -1:].values.tolist()
    y_pred_lst = y_pred.tolist()
#     print(y_lst)
    data = [y_lst, y_pred_lst]
#     print(data)
    res = pd.DataFrame({'Actuals':data[0], 'Predicted':data[1]})
#     print(res)
    plt.scatter(x=res['Actuals'], y=res['Predicted'])

# Driver code
smf_ols(df[['Salary', 'YearsExperience']]) # 0.957 accuracy
# smf_ols(df[['Norm_Salary', 'Norm_YearsExp']]) # 0.957 accuracy
Bar chart of actual values ​​vs. predicted values

Regression using statsmodels.api

It is no longer necessary to list predictors individually.

statsmodels.regression.linear_model.OLS (even, exog)

  • endog is the dependent variable
  • exog is the independent variable. An intercept is not included by default and must be added by the user (using add_constant).
# Create a helper function
def OLS_model(df):
    # defining the independent and dependent features
    x = df.iloc[:, 1:2]
    y = df.iloc[:, 0:1] 
    # Add a constant term to the predictor
    x = sm.add_constant(x)
#     print(x)
    model = sm.OLS(y, x)
    # Train the model
    results =
    # print('n'+"Confidence interval:"+'n', results.conf_int(alpha=0.05, cols=None)) #Returns the confidence interval of the fitted parameters. The default alpha=0.05 returns a 95% confidence interval.
    print('n'"Model parameters:"+'n',results.params)
    # print the overall summary of the model result
# Driver code
OLS_model(df[['Salary', 'YearsExperience']]) # 0.957 accuracy
OLS_model(df[['Norm_Salary', 'Norm_YearsExp']]) # 0.957 accuracy

We achieve a precision of 95,7%, which is pretty good 🙂

What does the model summary table say? 😕

It is always important to understand certain terms in the summary table of the regression model so that we can know the performance of our model and the relevance of the input variables.

OLS Regression Results Summary

Some important parameters to consider are the value of R squared, Adj. R-squared value, F statistic, prob (F statistic), intercept coefficient and input variables, p> | t |.

  • R-Squared is the coefficient of determination. A statistical measure that says that a large part of the data points are on the line of best fit. A value of R squared closer to 1 for a model to fit well.
  • Adj. R-squared penalizes the value of R-squared if we keep adding the new features that do not contribute to the prediction of the model. Si Adj. R squared value <R-squared value, is a sign that we have irrelevant predictors in the model.
  • The F statistic or F test helps us accept or reject the null hypothesis. Compare the intercept-only model with our model with features. The null hypothesis is “all regression coefficients are equal to zero and that means that both models are equal”. The alternative hypothesis is ‘intercepting the only model is worse than our model, which means that our added coefficients improved the performance of the model. If prob (F statistic) <0.05 and the F statistic is a high value, we reject the null hypothesis. It means that there is a good relationship between the input and output variables.
  • coef displays the estimated coefficients of the corresponding input characteristics
  • T-test talks about the relationship between the output and each of the input variables individually. The null hypothesis is 'the coefficient of an input characteristic is 0'. The alternative hypothesis is 'the coefficient of an input characteristic is not 0'. If pvalue 0.05.

Good, now we know how to draw important inferences from the model summary table, so now let's look at the parameters of our model and evaluate our model.

In our case, the value of R squared (0,957) is close to Adj. The value of R squared (0,955) is a good sign that the input characteristics are contributing to the predictor model.

The F statistic is a high number and p (F statistic) is almost 0, which means that our model is better than the single intersection model.

The p-value of the t-test for the input variable is less than 0.05, so there is a good relationship between the input variable and the output variable.

Therefore, we conclude by saying that our model is working well ✔😊

In this blog, We learned the basics of Simple Linear Regression (SLR), building a linear model using different Python libraries and making inferences from the summary table of OLS statistics models.


Interpreting the summary table from the OLS statistics model

Visualizations: Histogram, Density graph, violin weft, box plot, Normal QQ chart, Scatter plot, line graph, heat map, joint plot

See the full notebook of my GitHub repository.

I hope this is an informative blog for beginners. Please, vote for if you find this useful 🙌 Your comments are greatly appreciated. Happy learning !! 😎

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.