The choice of type Data processing and analysis techniques will decisively influence the outcome. Power and scalability are features that should be addressed in the same way as System capabilities to collect outliers, Detect fraudulent transactions or perform security checks. The most difficult task, despite this, it is Reduce analytics latency that is practiced in a complete set of big data, Something you need to process terabytes of data in seconds.
Requirements related to response time, The conditions of the data to be analyzed or the workload are the issues that will determine which is the best option with respect to Data processing and analysis techniques.
Photo credits: istock kentoh
Batch processing: for batches of big data
Apache Hadoop is a distributed computing framework the Google MapReduce model for Process large amounts of data in parallel. The Hadoop Distributed File System (HDFS) is the underlying file system of a Hadoop cluster, and Works more efficiently with a small number of high-volume big data files, than with a larger number of smaller data files.
A job in the world of Hadoop typically takes minutes to hours to complete, therefore, It could be said that the Hadoop option is not the most suitable when the company has the need to perform real-time analysis, but rather in cases where it is feasible to settle for offline analytics.
Recently, Hadoop has evolved to adapt to new business needs. Today's companies demand:
- Minimized response time.
- Maximum precision in decision-making.
Hadoop has been renewed improving its manageability thanks to a novelty known as stream. One of the main objectives of Hadoop streaming is to decouple Hadoop MapReduce from the paradigm to accommodate other parallel computing models, as MPI (Message Passing Interface) and Spark. With the App News Data processing and analysis techniques transmission Many of the limitations of Batch model that, although it may be considered too rigid for certain functions, Something that should not surprise us if we take into account that its origins date back more than four decades; It is still the most suitable, for cost-result linkage, for operations such as:
- Calculating the market value of assets, that does not need to be checked more than at least once a day.
- Monthly calculation of the cost of workers' phone bills.
- Generation of reports related to tax issues.
This type of Data processing and analysis techniques focus on the Implementing a data flow model in which data associated with time series (Facts) flow continuously through a network of transformation entities that make up the system.. Known as transmission or transmission processing.
No mandatory time limitations. in flow processing, Contrary to what happens with Data processing and analysis techniques in real time. As an example, A system that takes care of the word count included in each tweet for the 99,9% of processed tweets is a valid stream processing system. There is also no obligation regarding the time period to generate the output for each input received in the system. The only limitations are:
- Sufficient memory must be available To save queued entries.
- The long-term system productivity rate should be faster or at least equal to the data entry rate in the same period. If this were not the case, System storage requirements would grow without limits.
This type of Data processing and analysis techniques It is not intended to analyze a complete set of big data, so in general it does not have that capacity, with few exceptions.
Real-time data processing and analysis techniques
When data is processed in real time, The level of online analytical processing achieved is extremely high and the margin is less than seconds. This is exactly why, Real-time systems They do not usually use special mechanisms of atomicity and durability. They are only responsible for processing the input as soon as possible.
The question is what can happen if they lose the ticket. When this happens, ignore loss and continue to process and analyze without stopping. Depending on the environment, This is not an obstacle, as an example, in an ecommerce, But it can be in the security surveillance system of a bank or a military installation.. It is not good to lose information, But even technology has a limit and, by working in real time in real time, The system cannot leave operations to fix something that has already happened, was seconds behind. The data keeps coming in and the system must do everything feasible to continue processing it.
In any case, Data processing and analysis techniques In real time they deserve serious consideration, before implementation, Given the:
- They are not so simple to implement using common software systems.
- Their cost is higher than transmission alternatives.
- Depending on the purpose for which they are to be used, It may be preferable to opt for an intermediate option between streaming and real time, such as the one used by Amazon in its web portal and that guarantees a result that does not exceed one hundred or two hundred milliseconds in any case for the 99% of all requests