Reservoir sampling | Introduction to reservoir sampling

Contents

This article was published as part of the Data Science Blogathon.

Introduction

Big Data refers to a combination of structured and unstructured data that can be measured in petabytes or exabytes. As usual, we use 3V to characterize the 3V of big data, namely, the volume of data, the variety of data types and the speed at which they are processed.

These three characteristics make it difficult to manage big data. Therefore, big data is expensive in terms of investment on a large amount of server storage, sophisticated analysis machines and data mining methodologies. METROAny organization finds this cumbersome both technically and financially and, Thus, you are thinking about how to achieve Similar results can be achieved using much less sophistication.. Therefore, are trying to turn big data into small data., consisting of usable data chunks. The following figure [1] show a comparison.

67152bbva-openmind-banafa-big-data-basics-1-3871078

Let's try exploring a simple statistical technique, which can be used to create a usable piece of data from big data. The sample, which is basically a subset of the population, should be selected in such a way as to adequately represent the population. This can be ensured using statistical tests.

Introduction to reservoir sampling

The key idea behind reservoir sampling is to create a 'reservoir’ from a large ocean of data. Sea ‘N’ the size of the population and 'n’ The size of the sample. Each element of the population has the same probability of being present in the sample and that probability is (n / N). With this key idea, we have to create a subsample. It should be noted that when we create a sample, distributions must be identical not only in rows but also in columns.

As usual, we focus only on the rows, but it is also important to maintain the distribution of the columns. Las columnas son las características de las que aprende el algoritmo de training. Therefore, we also have to run statistical tests for each feature to ensure the distribution is identical.

The algorithm is as follows: Initialize the reservoir with the first 'n’ population elements of size 'N'. Then read each row in your data set (i> n). In each iteration, calculate (n / i). We replace the elements of the reservoir of the following set of 'n’ items with a gradually decreasing probability.


R[i] = S[i]

for i = n+1 to N:

j = U ~ [1, i]

if I <= n:

R[j] = S[i]

Statistical tests

As I mentioned before, we must make sure that all columns (features) of the reservoir are distributed identically to the population. We will use the Kolmogorov-Smirnov test for continuous characteristics and Pearson's chi-square test for categorical characteristics..

The Kolmogorov-Smirnov test is used to verify whether the cumulative distribution functions (CDF) of the population and the sample are the same. We compare the CDF of the population F_N (x) with the sample F_n (x).

𝐹𝑁𝑥

45519picture1-2261614

As n -> N, D_n -> 0, if the distributions are identical. This test should be performed for all characteristics of the data set that are continuous.

For categorical characteristics, we can perform Pearson's chi-square test. Let O_i be the number of observations in category ‘i’ and ne the number of samples. Let E_i be the expected count of category 'i'. Entonces E_i = N p_i, where p_i is the probability of belonging to category ‘i’. Then the chi-square value is given by the following relation:

28825picture2-9669003

If chi-square = 0, that means the observed values ​​and the expected values ​​are the same. If the p-value of the statistical test is greater than the significance level, we say that the sample is statistically significant.

Final notes

Reservoir sampling can be used to create a useful piece of data from big data as long as the two tests, Kolmogorov-Smirnov and Pearson's chi-square, be successful. Recent rumors are, of course, big data. Centralized models such as big data architecture come with great difficulties. To decentralize things and, Thus, make work modular, we have to create small bits of useful data and then get meaningful information from them. I think more efforts should be made in this direction, instead of investing in architecture to support big data.

References

1. https://www.bbvaopenmind.com/en/technology/digital-world/small-data-vs-big-data-back-to-the-basics/

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.