Bootstrap sampling | Bootstrap sampling in machine learning

Contents

Introduction

Have you ever struggled to improve your rank at a machine learning hackathon in DataHack o Kaggle? You've tried all your favorite tricks and techniques, but your score refuses to budge. I was there and it is quite a frustrating experience!

This is especially relevant during your first days in this field.. We tend to use familiar techniques that we have learned, like linear regression, logistic regression, etc. (depending on the problem statement).

And then comes Bootstrap Sampling. It's a powerful concept that propelled my rank into the upper echelons of these hackathon leaderboards.. And it was quite a learning experience!!

bootstrap-3838795

Bootstrap sampling is a technique that I feel like every data scientist, aspiring or established, You must learn.

Then, in this article, we will learn everything you need to know about boot sampling. What is it, because it is necessary, how it works and where it fits into the picture of machine learning. We will also implement bootstrap sampling in Python.

What is Bootstrap sampling?

Here is a formal definition of Bootstrap Sampling:

In statistics, Bootstrap Sampling is a method that involves the extraction of sample data repeatedly with replacement of a data source to estimate a population parameter.

Waiting, that's too complex. Let's analyze and understand the key terms:

  • Sampling: Regarding statistics, sampling is the process of selecting a subset of items from a wide collection of items (population) to estimate a certain characteristic of the entire population.
  • Sampling with replacement: It means that a data point in a drawn sample may also reappear in future drawn samples.
  • Parameter estimation: It is a method of estimating parameters for the population using samples. A parameter is a measurable characteristic associated with a population. For instance, the average height of residents in a city, red blood cell count, etc.

With that knowledge, go ahead and re-read the definition above. It will make a lot more sense now!

Why do we need Bootstrap sampling?

This is a fundamental question that I have seen machine learning enthusiasts grapple with.. What's the point of Bootstrap Sampling? Where can you use it? Let me take an example to explain this.

Let's say we want to find the mean height of all students in a school (which has a total population of 1000). Then, How can we perform this task?

One method is to measure the height of all students and then calculate the average height. I have illustrated this process below:

img_1-1-7953286

But nevertheless, this would be a tedious task. Just think about it, we would have to individually measure the heights of 1,000 students and then calculate the mean height. It will take days! We need a smarter approach here.

This is where Bootstrap Sampling comes in..

Instead of measuring the heights of all the students, we can draw a random sample of 5 students and measure their heights. We would repeat this process 20 times and then we would average the height data collected from 100 students (5 x 20). This average height would be an estimate of the average height of all students in the school.

Pretty straightforward, truth? This is the basic idea of ​​Bootstrap Sampling.

img_2-1-8912401

Therefore, when we have to estimate a parameter of a large population, we can take the help of Bootstrap Sampling.

Muestreo Bootstrap en Machine Learning

Bootstrap sampling is used in a machine learning ensemble algorithm called bootstrap aggregation (also called packaging). Helps prevent overfitting and improves stability of machine learning algorithms.

In the bagging, a certain number of subsets of the same size are extracted from a data set with replacement. Later, a machine learning algorithm is applied to each of these subsets and the outputs are assembled as illustrated below:

bagging-9992005

You can read and know more about co-learning here:

Implement Bootstrap Sampling in Python

Time to put our learning to the test and implement the Bootstrap Sampling concept in Python.

In this section, we will try to estimate the population mean with the help of bootstrap sampling. Let's import the necessary libraries:

Then, we will create a distribution (population) gaussiana de 10,000 elements with the population mean of 500:

Production: 500.00889503613934

Now, we will extract 40 size samples 5 of the distribution (population) and we will calculate the mean for each sample:

Let's check the average of the mean values ​​of the 40 samples:

np.mean(sample_mean)

Production: 500.024133172629

It turns out to be quite close to the population mean!! This is why Bootstrap Sampling is such a useful technique in statistics and machine learning..

Summarizing what we have learned

In this article, we learned about the usefulness of Bootstrap Sampling in statistics and machine learning. We also implement it in Python and verify its effectiveness.

Here are some of the key benefits of bootstrapping:

  • The parameter estimated by bootstrap sampling is comparable to the real population parameter
  • Since we only need some samples for startup, calculation requirement is much lower
  • A Random Forest, the bootstrap sample size of even the 20% gives pretty good performance as shown below:

rf-6123404

Model performance peaks when the data provided is less than 0,2 fraction of the original data set.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.