Regression algorithms | 5 regression algorithms you should know

Contents

En Machine Learning, we use various types of algorithms to allow machines to learn the relationships within the data provided and make predictions based on patterns or rules identified in the data set. Then, regression is a machine learning technique where the model predicts the output as a continuous numeric value.

stock-market-or-forex-trading-graph-in-graphic-concept

Source: https://www.hindish.com

Regression analysis is often used in finance, investments and others, and discover the relationship between a single dependent variable (target variable) which depends on several independent. For instance, predict house prices, the stock market or an employee's salary, etc. are the most common
regression problems.

The algorithms that we are going to cover are:

1. Linear regression

2. Decisions Tree

3. Support vector regression

4. Loop regression

5. Random forest

1. Linear regression

Linear regression is a machine learning algorithm used for supervised learning. Linear regression performs the task of predicting a dependent variable (objective) as a function of the given independent variables. Then, this regression technique finds a linear relationship between a dependent variable and the other independent variables given. Therefore, the name of this algorithm is Linear Regression.

87851lr-9126938

In the figure above, on the X axis is the independent variable and on the Y axis is the output. The regression line is the line that best fits a model. And our main objective in this algorithm is to find the line that best fits.

Pros:

  • Linear regression is easy to implement.
  • Less complexity compared to other algorithms.
  • Linear regression can cause overfitting, but it can be avoided by using some dimensionality reduction techniques, regularization and cross-validation techniques.

Cons:

  • Outliers seriously affect this algorithm.
  • It oversimplifies real-world problems by assuming a linear relationship between variables, therefore not recommended for practical use cases.

Implementation

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[2, 1], [3, 2], [4, 2], [5, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
lr = LinearRegression().fit(X, Y)
lr.predict(np.array([[1, 5]]))

Output
array([14.])

2. Decisions Tree

Decision tree models can be applied to all those data that contain numerical characteristics and categorical characteristics. Decision trees are good at capturing the non-linear interaction between the characteristics and the target variable.. Decision trees coincide in a way with human-level thinking, so it is very intuitive to understand the data.

52036dt-6549490

Source: https://dinhanhthi.com

For instance, if we are classifying how many hours a child plays in a particular climate, the decision tree looks a bit like this in the picture.

Then, in summary, a decision tree is a tree where each node represents a characteristic, each branch represents a decision and each leaf represents an outcome (numerical value for regression).

Pros:

  • Easy to understand and interpret, visually intuitive.
  • Can work with numerical and categorical characteristics.
  • Requires little data pre-processing: no need for one-hot encoding, dummy variables, etc.

Cons:

  • Tends to over-adjust.
  • A small change in the data tends to make a big difference in the tree structure, what causes instability.

Implementation

import numpy as np
from sklearn.tree import DecisionTreeRegressor
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
Y[::5] += 3 * (0.5 - rng.rand(16))
# Fit regression model
regr = DecisionTreeRegressor(max_depth=2)
regr.fit(X, Y)
# Predict
X_test = np.arange(0.0, 5.0, 1)[:, e.g. newaxis]
result = regr.predict(X_test)
print(result)

Output:
[ 0.05236068  0.71382568  0.71382568  0.71382568 -0.86864256]

3. Support vector regression

You must have heard of SVM, namely, Support Vector Machine. SVR also uses the same idea of ​​SVM but here it tries to predict the actual values. This algorithm uses hyperplanes to segregate the data. In case this separation is not possible, so use kernel trick where dimension increases and then data points become separable by hyperplane.

14173svr-1984277

Source: https://www.medium.com

In the figure above, the blue line is the hyperplane; The red line is the limit line

All data points are within the boundary line (Red line). The main objective of SVR is basically to consider the points that are within the limit line.

Pros:

  • Robust to outliers.
  • Excellent generalizability
  • High prediction accuracy.

Cons:

  • Not suitable for large data sets.
  • They don't work very well when the dataset has more noise.

Implementation

from sklearn.svm import SVR
import numpy as np
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
Y[::5] += 3 * (0.5 - rng.rand(16))
# Fit regression model
svr = SVR().fit(X, Y)
# Predict
X_test = np.arange(0.0, 5.0, 1)[:, e.g. newaxis]
svr.predict(X_test)
Output:
array([-0.07840308,  0.78077042,  0.81326895,  0.08638149, -0.6928019 ])

4. Loop regression

  • LASSO stands for Absolute Minimum Selection Shrinkage Operator. Shrinkage is basically defined as an attribute or parameter constraint.
  • The algorithm works by finding and applying a constraint to the attributes of the model that cause the regression coefficients of some variables to reduce to zero..
  • Variables with a regression coefficient of zero are excluded from the model.
  • Therefore, loop regression analysis is basically a method of variable selection and contraction and helps to determine which of the predictors are most important.

Pros:

Cons:

  • LASSO will select only one feature from a group of correlated features
  • Selected features may be highly skewed.

Implementation

from sklearn import linear_model
import numpy as np
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
Y[::5] += 3 * (0.5 - rng.rand(16))
# Fit regression model
lassoReg = linear_model.Lasso(alpha=0.1)
lassoReg.fit(X,Y)
# Predict
X_test = np.arange(0.0, 5.0, 1)[:, e.g. newaxis]
lassoReg.predict(X_test)
Output:
array([ 0.78305084,  0.49957596,  0.21610108, -0.0673738 , -0.35084868])

5. Random Forest Regresser

Random forests are a set (combination) decision trees. It is a supervised learning algorithm that is used for classification and regression. Input data is passed through multiple decision trees. It is executed by building a different number of decision trees at the time of training and generating the class that is the mode of the classes (for classification) or mean prediction (for regression) of individual trees.

83963rfr-2580351

Source: https://levelup.gitconnected.com

Pros:

  • Good for learning complex and non-linear relationships
  • Very easy to interpret and understand.

Cons:

  • They are prone to over-adjusting
  • Using larger random forest pools for higher performance slows down their speed and then they also require more memory.

Implementation

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2, random_state=0, shuffle=False)
rfr = RandomForestRegressor(max_depth=3)
rfr.fit(X, Y)
print(rfr.predict([[0, 1, 0, 1]]))

Output:
[33.2470716]

Final notes

These are some popular regression algorithms, there are many more and also advanced algorithms. Explore them too. You can also follow these classification algorithms to increase your machine learning knowledge.

Thanks for reading if you got here 🙂

Let's connect LinkedIn

The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.