This article was published as part of the Data Science Blogathon.
Introduction
Big Data refers to a combination of structured and unstructured data that can be measured in petabytes or exabytes. As usual, we use 3V to characterize the 3V of big data, namely, the volume of data, the variety of data types and the speed at which they are processed.
These three characteristics make it difficult to manage big data. Therefore, big data is expensive in terms of investment on a large amount of server storage, sophisticated analysis machines and data mining methodologies. METROAny organization finds this cumbersome both technically and financially and, Thus, you are thinking about how to achieve Similar results can be achieved using much less sophistication.. Therefore, are trying to turn big data into small data., consisting of usable data chunks. The following figure"Figure" is a term that is used in various contexts, From art to anatomy. In the artistic field, refers to the representation of human or animal forms in sculptures and paintings. In anatomy, designates the shape and structure of the body. What's more, in mathematics, "figure" it is related to geometric shapes. Its versatility makes it a fundamental concept in multiple disciplines.... [1] show a comparison.
Let's try exploring a simple statistical technique, which can be used to create a usable piece of data from big data. The sample, which is basically a subset of the population, should be selected in such a way as to adequately represent the population. This can be ensured using statistical tests.
Introduction to reservoir sampling
The key idea behind reservoir sampling is to create a 'reservoir’ from a large ocean of data. Sea ‘N’ the size of the population and 'n’ The size of the sample. Each element of the population has the same probability of being present in the sample and that probability is (n / N). With this key idea, we have to create a subsample. It should be noted that when we create a sample, distributions must be identical not only in rows but also in columns.
As usual, we focus only on the rows, but it is also important to maintain the distribution of the columns. Las columnas son las características de las que aprende el algoritmo de trainingTraining is a systematic process designed to improve skills, physical knowledge or abilities. It is applied in various areas, like sport, Education and professional development. An effective training program includes goal planning, regular practice and evaluation of progress. Adaptation to individual needs and motivation are key factors in achieving successful and sustainable results in any discipline..... Therefore, we also have to run statistical tests for each feature to ensure the distribution is identical.
The algorithm is as follows: Initialize the reservoir with the first 'n’ population elements of size 'N'. Then read each row in your data set (i> n). In each iteration, calculate (n / i). We replace the elements of the reservoir of the following set of 'n’ items with a gradually decreasing probability.
R[i] = S[i]
for i = n+1 to N:
j = U ~ [1, i]
if I <= n:
R[j] = S[i]
Statistical tests
As I mentioned before, we must make sure that all columns (features) of the reservoir are distributed identically to the population. We will use the Kolmogorov-Smirnov test for continuous characteristics and Pearson's chi-square test for categorical characteristics..
The Kolmogorov-Smirnov test is used to verify whether the cumulative distribution functions (CDF) of the population and the sample are the same. We compare the CDF of the population F_N (x) with the sample F_n (x).
𝐹𝑁𝑥
As n -> N, D_n -> 0, if the distributions are identical. This test should be performed for all characteristics of the data set that are continuous.
For categorical characteristics, we can perform Pearson's chi-square test. Let O_i be the number of observations in category ‘i’ and ne the number of samples. Let E_i be the expected count of category 'i'. Entonces E_i = N p_i, where p_i is the probability of belonging to category ‘i’. Then the chi-square value is given by the following relation:
If chi-square = 0, that means the observed values and the expected values are the same. If the p-value of the statistical test is greater than the significance level, we say that the sample is statistically significant.
Final notes
Reservoir sampling can be used to create a useful piece of data from big data as long as the two tests, Kolmogorov-Smirnov and Pearson's chi-square, be successful. Recent rumors are, of course, big data. Centralized models such as big data architecture come with great difficulties. To decentralize things and, Thus, make work modular, we have to create small bits of useful data and then get meaningful information from them. I think more efforts should be made in this direction, instead of investing in architecture to support big data.
References
1. https://www.bbvaopenmind.com/en/technology/digital-world/small-data-vs-big-data-back-to-the-basics/