Feature Selection — Overview of Everything You Need to Know

We will consider the existing methods of feature selection for learning tasks with and without a teacher. Each method is illustrated with an open-source Python implementation so you can quickly test the proposed algorithms. However, this is not a complete selection: over the past 20 years, many algorithms have been created, and here you will find the most basic ones. For more in-depth exploration, check out this overview.

Image for post
Image for post
Feature Selection analogy. Source: Pinterest

The correct selection of features for data analysis allows:

  • improve the quality of supervised and unsupervised machine learning models,

Assessing the importance of features is necessary to interpret model results.

Models with and without a teacher

There are supervised selection algorithms that allow you to determine the appropriate features for better performance of supervised learning tasks (for example, in classification and regression problems). These algorithms need access to labeled data. For unlabeled data, there are also a number of feature selection methods that evaluate all features based on various criteria: variance, entropy, ability to maintain local similarity, etc. Relevant features discovered using unsupervised heuristic methods can also be used in supervised models because they allow other patterns to be detected in the data besides the correlation of features with the target variable.

Feature selection methods are usually divided into 4 categories: filters, wrappers, inline, and hybrid.

Wrappers

With this approach, we estimate the efficiency of a subset of features, taking into account the final result of the applied learning algorithm (for example, what is the increase in accuracy when solving the classification problem). Any learning algorithm can be used in this combination of search strategy and simulation.

Image for post
Image for post

Existing selection strategies:

  • Forward selection: start with an empty feature set and then iteratively add features that provide the best gain in model quality.

Implementation: these algorithms are implemented in the mlxtend package, here is an example of use.

  • RFE (Recursive feature elimination, recursive removal characteristics ) “greedy” search algorithm which selects features by a recursive definition of increasingly small feature sets. It ranks the features according to the order in which they are removed.

Built-in methods

This group includes algorithms that simultaneously train the model and select features. This is usually done with an L1- regularizer (sparsity regularizer) or a condition that constrains some features.

  • SMLR (Sparse Logistic Regression MULTINOMIAL, sparse multinomial logistic regression ): This algorithm implements the l1-regularization using ARD (Automatic relevance determination, automatic determination of relevance ) in the classical multinomial logistic regression. Regularization determines the importance of each feature and nullifies those that are useless for forecasting.
Image for post
Image for post
  • ARD zeroes out the weight of some features, thereby helping to identify the relevant dimensions.

Other examples of regularization algorithms: Lasso (implements l1 -regularization), ridge regression (implements l2 -regularization), Elastic Net (implements l1- and l2 -regularization). If you depict these methods graphically, you can see that Lasso regression limits the coefficients to the area of ​​the square, ridge regression outlines a circle, and Elastic Net occupies an intermediate position.

Image for post
Image for post

https://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_penalties.html

A comprehensive description of these algorithms is provided here.

Filters

With this approach, we evaluate the importance of features only on the basis of their inherent characteristics, without involving learning algorithms. These methods are faster and less computationally expensive than wrapper methods. If there is not enough data to model the statistical correlation between features, then filters can give worse results than wrappers. Unlike wrappers, such methods are less prone to overfitting. They are widely used to work with high-dimensional data when wrapping methods are too computationally intensive.

Supervised methods

Relief

  • This method randomly selects samples from a dataset and updates the significance of each feature based on the difference between the selected instance and the two closest objects of the same and opposite class. If there is a difference in the values ​​of a feature for the two nearest neighbors of the same class, its importance decreases, and if, on the contrary, there is a difference between the values ​​of the feature for objects of different classes, the importance, respectively, increases.
Image for post
Image for post
  • The weight of a feature decreases if its value differs for the closest objects of the same class more than for the closest objects from different classes; otherwise, the weight increases.
    The advanced ReliefF algorithm uses feature weighting and searches for more nearest neighbors.

Fisher score

  • Typically used in binary classification problems. The Fisher ratio (FiR) is defined as the distance between the mean values ​​of traits for each class, divided by their variance:
Image for post
Image for post

Chi-squared score

  • Checks if there is a significant difference between the observed and expected frequencies of two categorical variables. Thus, the null hypothesis that there is no relationship between the two variables is tested.
Image for post
Image for post
Chi-square test of independence.
  • To correctly apply the chi-square test to check the relationship between different features from the dataset and the target variable, the following conditions must be met: the variables must be categorical, independent, and must have an expected frequency greater than 5. The latter condition ensures that the CDF (cumulative density function) of the test statistic can be approximated using the chi-square distribution. Read more here.

CFS

CFS (Correlation-based feature selection, feature selection based on correlations ): The rationale for this method can be stated as follows:

Features are relevant if their meanings change systematically depending on belonging to a particular category.

Thus, a good subset of features contains features that are highly correlated with the target variable without correlating with each other. The estimate for a subset of k features is calculated as follows:

Image for post
Image for post

Here r_{cf} Is the mean of all correlations between the trait and the class, and r¯_{ff}- the average value of all correlations between features. The CFS criterion is defined as follows:

Image for post
Image for post

FCBF

  • FCBF (the Fast Correlation-based filter, quick filter based correlation ): This method is faster and more efficient than ReliefF and of CFS, and is therefore often used for high-dimensional input data. In fact, this is a typical approach that takes into account relevance and redundancy, in which first, for all features, Symmetrical Uncertainty is calculated (mutual information between X and YI (X, Y), divided by the sum of their entropies), then the features are sorted by this criterion, and then redundant ones are removed.

Methods without a teacher

Variance

  • It has been shown that estimating the variance of a feature can be an effective way to select features. As a rule, features with almost zero variance are not significant and can be removed.

Average absolute difference

  • We calculate the average absolute difference between the values ​​of the attribute and its average value ( implementation).
Image for post
Image for post
  • Higher values ​​tend to have higher predictive power.

The ratio of variances

  • The arithmetic mean divided by the geometric mean. Higher variance corresponds to more relevant characteristics (implementation).
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Laplacian Score

  • It is based on the observation that data from the same class are often located closer to each other, so the importance of a trait can be assessed by its ability to reflect this proximity. The method consists of embedding data in the nearest neighbor's graph by measuring an arbitrary distance and then calculating the weight matrix. Then, for each feature, we calculate the Laplace criterion and obtain such a property that the smallest values ​​correspond to the most important dimensions. However, in practice, when selecting a subset of features, a different clustering algorithm (k-means method) is usually used, with the help of which the most effective group is selected.

Laplace’s test combined with distance-based entropy

  • The algorithm is based on the Laplace’s test, where k-means clustering is replaced by entropy. The algorithm demonstrates higher stability on high-dimensional datasets (implementation).

MCFS

  • MCFS (Multi-Cluster Feature selection, multicluster feature selection ): the spectral analysis is performed to measure the correlation between the different attributes. Eigenvectors of the Laplace operator (graph Laplacian) are used to cluster data and evaluate features. Their calculation is described in this work.

LFSBSS

  • Algorithms LFSBSS (Localised feature selection, selection of localized symptoms ), weighted k-average (weighted k-means) SPEC, and Apriori discussed herein and implemented in the package.

Hybrid methods

Another way to implement feature selection is a hybrid of filters and wrappers combined in a two-phase process: first, features are filtered by statistical properties, and then wrapper methods are applied.

Other sources

A lot of literature has been written in which the problem of feature selection is considered, and here we have only slightly touched on the entire array of scientific research works.

A complete list of other feature selection algorithms that I have not mentioned has been implemented in the scikit-feature package.

We can etermine relevant features also using PLS (Partial least squares, partial least squares ) as described in this article, or by a linear dimension reduction methods, as shown here.

Written by

Bioinformatician at Oncobox Inc. (@oncobox). Research Associate at Moscow Institute of Physics and Technology (@mipt_eng).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store