Genetic algorithms and their use cases in Machine Learning

Contents

This article was published as part of the Data Science Blogathon

Genetic algorithms are search algorithms inspired by Darwin's theory of evolution in nature.

  • By simulating the process of natural selection, reproduction and mutation, genetic algorithms can produce high-quality solutions to various problems, including search and optimization.

  • Through the effective use of the Theory of Evolution, genetic algorithms can overcome the problems faced by traditional algorithms.

According to Darwin's theory of evolution, an evolution maintains a population of individuals that vary from each other (variation). Those who are better adapted to their environment have a better chance of survival, reproduce and pass on their traits to the next generation (survival of the fittest).

‌How do genetic algorithms work?

‌Before entering into the operation of a genetic algorithm, Let's dive into the basic terminology of genetic algorithms.

Chromosome / individual

A chromosome is a collection of genes. For instance, a chromosome can be represented as a binary string where each bit is a gene.

644781623127595061-8406334

Image source: practical genetic algorithm with Python, Eyal Wirsansky

population

Since an individual is represented as a chromosome, a population is a collection of such chromosomes.

779521623127686493-5364456

Image source: practical genetic algorithm with Python, Eyal Wirsansky

Fitness function

In each iteration, individuals are assessed based on their fitness scores that are calculated using the fitness function.

People who achieve a better fitness score represent better solutions and are more likely to be chosen to cross over and pass on to the next generation.

For instance, whether genetic algorithms are used for feature selection, then the precision of the model with those selected characteristics would be the adequacy function if it were a classification problem.

selection

After calculating the fitness of each individual in the population, a selection process is used to determine which of the individuals in the population will be able to reproduce and create the offspring that will form the next generation.

Types of selection methods available,

  • Roulette wheel selection
  • Tournament selection
  • Range-based selection

Transversal

Generally, two individuals of the current generation are chosen and their genes are exchanged between two individuals to create a new individual that represents the offspring. This process is also called mating or crossing..

Types of cross methods available,

  • Crossing a point
  • Two-point crossover
  • Uniform crossover
505071623128127345-5633004
Crossing a pointImage source: practical genetic algorithms with Python, Eyal Wirsansky

Mutation

Mutation is a random change on a chromosome to introduce new patterns on a chromosome. For instance, flip a bit in a binary string.

Types of mutation methods available,

  • Flip bit mutation
  • Gaussian mutation
  • Exchange mutation
550041623128158975-5413371

General workflow of a simple genetic algorithm

244111623128199688-4801325

Image source: practical genetic algorithms with Python, Eyal Wirsansky

How are genetic algorithms different from traditional algorithms?

  • ‌A search space is a set of all possible solutions to the problem. Traditional algorithms keep only one set in a search space, whereas genetic algorithms use multiple sets in a search space (feature selection using RFE versus genetic algorithms).

  • Traditional algorithms require more information to perform a search, whereas genetic algorithms only require an objective function to calculate the fitness of an individual.

  • Traditional algorithms cannot work in parallel, while genetic algorithms can work in parallel (the calculation of the fitness of individuals is independent).

  • A big difference in genetic algorithms is that instead of directly operating on candidate solutions, genetic algorithms operate on their representations (or encoding), often called chromosomes.

  • ‌Traditional algorithms can only produce a solution in the end, while in genetic algorithms multiple optimal solutions of different generations can be obtained.

  • ‌Traditional algorithms are no more likely to produce global optimal solutions, genetic algorithms are not guaranteed to produce global optimal solutions as well, but they are more likely to produce global optimal solutions due to genetic operators such as crossover and mutation.

  • Traditional algorithms are deterministic in nature, while genetic algorithms are probabilistic and stochastic in nature.

  • Real world problems are multimodal (contain multiple optimal solutions locally), traditional algorithms do not handle these problems well while genetic algorithms, con la configuración de parameters correcta, can handle these problems very well due to the large solution space.

Advantages of genetic algorithms

  • Parallelism

  • Global optimization

  • A larger set of space for solutions

  • Requires less information

  • Provides multiple optimal solutions

  • Probabilist in nature

  • Genetic representations using chromosomes

Disadvantages of genetic algorithms

‌ Use cases of genetic algorithms in machine learning

TPOT(Tree-based pipeline optimization) is an Auto-ML framework that uses genetic algorithms to optimize machine learning pipelines using the genetic algorithm framework called DEAP (Distributed evolutionary algorithms in Python).

Selection of characteristics using genetic algorithms

Using genetic algorithms, let's perform feature selection on a wine dataset available in sklearn.

import random
import pandas as pd
from sklearn.base import clone
from deap.algorithms import eaSimple
from deap import base, creator, tools
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

data = load_wine()
features = pd.DataFrame(
    data.data, columns=data.feature_names
)
target = pd.Series(data.target)

# split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2, stratify=target
)

model = RandomForestClassifier()

Divida los datos en características y destino y luego divida las características y el destino para training y prueba.

def compute_fitness_score(individual):
    """
    Select the features from the individual, train
    and compute the accuracy_score.
    
    Example:
    individual = [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1]
    The 1 represents the presence of features and
    0 represents the absence of features
    
    """
    column_support = pd.Series(individual).astype(bool)
    global x_train, y_train, x_test, y_test, model
    
    x_train_ = x_train[x_train.columns[column_support]]
    x_test_ = x_test[x_test.columns[column_support]]

    model.fit(x_train_, y_train)
    y_pred = model.predict(x_test_)
    score = accuracy_score(y_test, y_pred)
    
    return score,

Define the fitness score. compute_fitness_score takes an individual as input, for instance, let's consider the following individual [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1], on this list, 1 represents the presence of that particular characteristic and 0 represents the absence of that characteristic. The columns are selected based on the individual and then the model is fitted and the precision score is calculated. This accuracy score is the physical fitness score of an individual.

n_genes = x_train.shape[1]
n_generations = 10
n_population = 10
crossover_probability = 0.6
mutation_probability = 0.2


def setup_toolbox():
    # container for individuals
    creator.create('FitnessMax', base.Fitness, weights=(1.0,))
    creator.create('Individual', list, fitness=creator.FitnessMax)

    toolbox = base.Toolbox()
    toolbox.register(
        'individual_generator_function',
        random.randint, 0, 1
    )
    # method to populate individual
    toolbox.register(
        'individual_generator',
        tools.initRepeat,
        creator.Individual,
        toolbox.individual_generator_function,
        n_genes
    )
    # method to create population
    toolbox.register(
        'population_generator',
        tools.initRepeat,
        list,
        toolbox.individual_generator
    )
    # fitness calculation
    toolbox.register(
        'evaluate', compute_fitness_score
    )
    # selection
    toolbox.register(
        'select', tools.selTournament, tournsize=3
    )
    # crossover
    toolbox.register('mate', tools.cxOnePoint)
    # mutation
    toolbox.register(
        'mutate',
        tools.mutFlipBit,
        indpb=mutation_probability
    )
    return toolbox

n_genes represent the number of genes in an individual that is equal to the number of characteristics, n_generations represents the number of generations that is equal to 10 and also n_population which represents the population number. The crossover probability is set to 0,6 and the mutation probability is set to 0,2.

Toolbox is a class in a deap framework that provides us with the necessary utilities to define individuals, populations and much more.

The steps performed in the setup_toolbox method are,

  • First, we define an individual who has an aptitude score associated with him
  • Then, we define a function that generates genes of an individual, in our case 0 O 1 in an individual
  • We define a function that generates individuals from a population, for instance, [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1] is an individual in a population.
  • Then, we define a function that generates a population, for instance, [[0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1],

    [0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1]]is a population of size 2.

  • Then, we define a function for evaluation, in our case we use the compute_fitness_score function.

  • We choose the tournament selection method for the selection process, a crossover point for the crossover process and a shift bit mutation for the mutation process.

# setup deap toolbox
toolbox = setup_toolbox()

# create a population
population = toolbox.population_generator(n_population)

# A simple evolutionary algorithm
final_population, logbook = eaSimple(
    population, 
    toolbox, 
    crossover_probability, 
    mutation_probability, 
    n_generations
)

We first generate a population of individuals of size 10. Then we call the eaSimple method in depth, that performs a simple genetic algorithm.

The arguments of the eaSimple function,

  • The first argument is the population
  • toolbox is the instance of the Toolbox class from deap
  • The probability of crossing denotes the probability that a pair of individuals will cross
  • Mutation probability denotes the probability that an individual will mutate
  • n_generations denotes the number of generations to evolve.

The eaSimple algorithm works as follows,

  1. The fitness of individuals in a population is calculated
  2. Select the fittest individual using the tournament selection method.
  3. Crossing of individuals using the one-point crossing method
  4. Mutate individuals using the flip bit mutation
  5. Recalculate the fitness of individuals.
  6. Follow the steps 2 a 5 until reaching the desired number of generations.

Thus, can perform feature selection using a genetic algorithm.

References

[1] Practical Genetic Algorithms with Python, Eyal Wirsansky

Thanks!

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.