Function selection techniques in machine learning

Contents

Introduction

When creating a machine learning model in real life, it is almost rare that all the variables in the dataset are useful for creating a model. Adding redundant variables reduces the generalizability of the model and can also reduce the overall precision of a classifier. What's more, adding more and more variables to a model increases the overall complexity of the model.

According to him Law of parsimony of ‘Occam's Razor’, the best explanation of a problem is the one that involves the fewest possible assumptions. Therefore, feature selection becomes an indispensable part of building machine learning models.

Target

The goal of feature selection in machine learning is to find the best feature set that allows you to build useful models of the studied phenomena..

Techniques for function selection in machine learning can be broadly classified into the following categories:

Supervised techniques: These techniques can be used for labeled data and are used to identify relevant characteristics to increase the efficiency of supervised models such as classification and regression..

Unsupervised techniques: These techniques can be used for unlabeled data.

From a taxonomic point of view, These techniques are classified into:

A. Filtering methods

B. Wrapping methods

C. Integrated methods

D. Hybrid methods

In this article, we will discuss some popular feature selection techniques in machine learning.

A. Filtering methods

Filter methods collect intrinsic properties of measured characteristics through univariate statistics rather than cross-validation performance. These methods are faster and less computationally expensive than wrapper methods. When it comes to high-dimensional data, it is computationally cheaper to use filtering methods.

Let's analyze some of these techniques:

Information gain

The information gain calculates the entropy reduction from the transformation of a data set. It can be used for the selection of characteristics by evaluating the information gain of each variable in the context of the target variable.

image-2-1-3866781

Chi-square test

The chi-square test is used for categorical characteristics in a data set. We calculate Chi-square between each characteristic and the target and select the desired number of characteristics with the best Chi-square scores.. To correctly apply chi-square to test the relationship between various characteristics in the data set and the target variable, the following conditions must be met: variables must be categorical, sampled regardless and the values ​​must have a expected frequency greater than 5.

image-3-1-5037882

Fisher score

The Fisher score is one of the most widely used selection methods for supervised characteristics. The algorithm that we will use returns the ranges of the variables based on the fisherman's score in descending order. Then we can select the variables according to the case.

image-4-1-2161850

Correlation coefficient

Correlation is a measure of the linear relationship of 2 or more variables. By correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that good variables are highly correlated with target. What's more, the variables must be correlated with the objective, but they must not be correlated with each other.

If two variables are correlated, we can predict one from the other. Therefore, if two characteristics are correlated, the model really only needs one of them, since the second does not add additional information. We will use Pearson's correlation here.

image-5-1-7481091

We need to set an absolute value, Let's say 0.5 as the threshold for selecting variables. If we find that the predictor variables are correlated with each other, we can discard the variable that has a lower correlation coefficient value with the target variable. We can also calculate multiple correlation coefficients to check if more than two variables are correlated with each other. This phenomenon is known as multicollinearity..

Variance threshold

The variance threshold is a simple baseline approach to feature selection. Eliminate all characteristics whose variation does not reach a threshold. By default, removes all zero variance characteristics, namely, characteristics that have the same value in all samples. We assume that the characteristics with a higher variance may contain more useful information., but note that we are not taking into account the relationship between the characteristic variables or the characteristic and target variables, which is one of the drawbacks of filter methods.

image-6-1-8090829

Get_support returns a Boolean vector where True means that the variable does not have zero variance.

Mean absolute difference (MAD)

‘The mean absolute difference (MAD) calculates the absolute difference from the mean value. The main difference between the variance measures and MAD is the absence of the square in the latter.. The MAD, like the variance, it is also a variant of scale ». [1] This means that the higher the DMA, greater discriminatory power.

image-7-1-4238673

Dispersion ratio

‘Another measure of dispersion applies the arithmetic mean (AM) and the geometric mean (GM). For a given characteristic (positive) XI in n patterns, AM and GM are given by

image-16-7002411

respectively; as SOYI ≥ GMI, with equality if and only if Xi1 = Xi2 =…. = Xin, then the proportion

image-17-4829304

can be used as a measure of dispersion. A greater dispersion implies a greater value of Ri, so a more relevant feature. Conversely, when all feature samples have (about) the same value, Ri is close to 1, which indicates a characteristic of low relevance '. [1]

image-8-1-5876813

image-9-1543135

B. Wrapping methods:

Wrappers require some method to search space for all possible subsets of features, evaluating its quality by learning and evaluating a classifier with that subset of characteristics. The feature selection process is based on a specific machine learning algorithm that we try to fit into a given data set. Follows a greedy search approach by evaluating all possible combinations of characteristics against the evaluation criteria. Wrapper methods generally result in better predictive accuracy than filter methods.

Let's analyze some of these techniques:

Selection of advanced functions

This is an iterative method where we start with the best performing variable against the target. Then, we select another variable that offers the best performance in combination with the first variable selected. This process continues until the preset criteria is reached..

image-10-4975489

Backward feature removal

This method works the exact opposite of the forward feature selection method. Here, we start with all the available functions and build a model. Then, we take the model variable that gives the best evaluation measure value. This process continues until the preset criteria is reached..

image-11-5415191

This method, along with the one discussed above, also known as the sequential feature selection method.

Comprehensive feature selection

This is the most robust feature selection method covered so far. This is a brute force evaluation of each feature subset. This means that it tries all possible combinations of the variables and returns the best performing subset.

image-12-3752197

Elimination of recursive features

Given an external estimator that assigns weights to the characteristics (for instance, the coefficients of a linear model), the goal of eliminating recursive features (RFE) is to select features recursively considering increasingly smaller feature sets. First, the estimator is trained on the initial set of features and the importance of each feature is obtained through a coef_ attribute or through a feature_importances_ attribute.

Later, less important features are removed from the current feature set. This procedure is recursively repeated on the pruned set until the desired number of features to select is finally reached.. ‘[2]

image-13-4515923

C. Integrated methods:

These methods encompass the benefits of both wrap and filter methods., by including feature interactions but also maintaining a reasonable computational cost. Built-in methods are iterative in the sense that they take care of each iteration of the model training process and carefully extract the characteristics that contribute the most to training for a particular iteration..

Let's analyze some of these techniques, Click here:

LASSO regularization (L1)

Regularization consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model, namely, to avoid over adjustment. In the regularization of linear models, the penalty is applied on the coefficients that multiply each of the predictors. Of the different types of regularization, Lasso or L1 has the property of reducing some of the coefficients to zero. Therefore, that feature can be removed from the model.

image-14-4381766

Importance of the random forest

Random Forests is a kind of bagging algorithm that adds a specific number of decision trees. Tree-based strategies used by random forests are naturally ranked based on how well they improve node purity, or in other words, a decrease in impurity (Gini impurity) over all the trees. The nodes with the greatest decrease in impurities occur at the beginning of the trees, while the notes with the least decrease in impurities occur at the end of the trees. Therefore, when pruning trees under a particular node, we can create a subset of the most important characteristics.

image-15-3340326

Conclution

We have discussed some techniques for function selection. We have purposely left feature extraction techniques like Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc. These methods help reduce the dimensionality of the data or reduce the number of variables while preserving the variance of the data..

Apart from the methods discussed above, there are many other feature selection methods. There are also hybrid methods that use filtering and wrapping techniques.. If you want to explore more about feature selection techniques, in my opinion, an excellent comprehensive reading material would be ‘Selection of functions for pattern and data recognition«See Urszula Stańczyk y Lakhmi C.. Jain.

References

Document called 'Efficient Feature Selection Filters for High-Dimensional Data’ by Artur J. Ferreira, Mario AT Figueiredo [1]

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html%20%5b2%5d [2]

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.