The choice of type Data processing and analysis techniques will decisively influence the outcome. Power and scalability are features that should be addressed in the same way as System capabilities to collect outliers, Detect fraudulent transactions or perform security checks. The most difficult task, despite this, it is Reduce analytics latency that is practiced in a complete set of big data, Something you need to process terabytes of data in seconds.
Requirements related to response time, The conditions of the data to be analyzed or the workload are the issues that will determine which is the best option with respect to Data processing and analysis techniques.

Photo credits: istock kentoh
Batch processing: for batches of big data
Apache Hadoop It is a distributed computing framework, the Google model MapReduceMapReduce is a programming model designed to efficiently process and generate large data sets. Powered by Google, This approach breaks down work into smaller tasks, which are distributed among multiple nodes in a cluster. Each node processes its part and then the results are combined. This method allows you to scale applications and handle massive volumes of information, being fundamental in the world of Big Data.... to Process large amounts of data in parallel. The Distributed File SystemA distributed file system (DFS) Allows storage and access to data on multiple servers, facilitating the management of large volumes of information. This type of system improves availability and redundancy, as files are replicated to different locations, reducing the risk of data loss. What's more, Allows users to access files from different platforms and devices, promoting collaboration and... the Hadoop (HDFSHDFS, o Hadoop Distributed File System, It is a key infrastructure for storing large volumes of data. Designed to run on common hardware, HDFS enables data distribution across multiple nodes, ensuring high availability and fault tolerance. Its architecture is based on a master-slave model, where a master node manages the system and slave nodes store the data, facilitating the efficient processing of information..) It is the underlying file system of a clusterA cluster is a set of interconnected companies and organizations that operate in the same sector or geographical area, and that collaborate to improve their competitiveness. These groupings allow for the sharing of resources, Knowledge and technologies, fostering innovation and economic growth. Clusters can span a variety of industries, from technology to agriculture, and are fundamental for regional development and job creation.... of Hadoop and Works more efficiently with a small number of high-volume big data files, than with a larger number of smaller data files.
A job in the world of Hadoop typically takes minutes to hours to complete, therefore, It could be said that the Hadoop option is not the most suitable when the company has the need to perform real-time analysis, but rather in cases where it is feasible to settle for offline analytics.
Recently, Hadoop has evolved to adapt to new business needs. Today's companies demand:
- Minimized response time.
- Maximum precision in decision-making.
Hadoop has been renewed improving its manageability thanks to a novelty known as stream. One of the main objectives of Hadoop streaming is to decouple Hadoop MapReduce from the paradigm to accommodate other parallel computing models, as MPI (Message Passing Interface) and Spark. With the App News Data processing and analysis techniques transmission Many of the limitations of Batch model that, although it may be considered too rigid for certain functions, Something that should not surprise us if we take into account that its origins date back more than four decades; It is still the most suitable, for cost-result linkage, for operations such as:
- Calculating the market value of assets, that does not need to be checked more than at least once a day.
- Monthly calculation of the cost of workers' phone bills.
- Generation of reports related to tax issues.
Flow processing
This type of Data processing and analysis techniques focus on the Implementing a data flow model in which data associated with time series (Facts) flow continuously through a network of transformation entities that make up the system.. Known as transmission or transmission processing.
No mandatory time limitations. in flow processing, Contrary to what happens with Data processing and analysis techniques in real time. As an example, A system that takes care of the word count included in each tweet for the 99,9% of processed tweets is a valid stream processing system. There is also no obligation regarding the time period to generate the output for each input received in the system. The only limitations are:
- Sufficient memory must be available To save queued entries.
- The long-term system productivity rate should be faster or at least equal to the data entry rate in the same period. If this were not the case, System storage requirements would grow without limits.
This type of Data processing and analysis techniques It is not intended to analyze a complete set of big data, so in general it does not have that capacity, with few exceptions.
Real-time data processing and analysis techniques
When data is processed in real time, the level of online analytical processing achieved is extremely high and the marginMargin is a term used in a variety of contexts, such as accounting, Economics and printing. In accounting, refers to the difference between revenue and costs, which allows the profitability of a business to be evaluated. In the publishing field, The margin is the white space around the text on a page, that makes it easy to read and provides an aesthetic presentation. Its correct management is essential.. is less than seconds. This is exactly why, Real-time systems They do not usually use special mechanisms of atomicity and durability. They are only responsible for processing the input as soon as possible.
The question is what can happen if they lose the ticket. When this happens, ignore loss and continue to process and analyze without stopping. Depending on the environment, This is not an obstacle, as an example, in an ecommerce, But it can be in the security surveillance system of a bank or a military installation.. It is not good to lose information, But even technology has a limit and, by working in real time in real time, The system cannot leave operations to fix something that has already happened, was seconds behind. The data keeps coming in and the system must do everything feasible to continue processing it.
In any case, Data processing and analysis techniques In real time they deserve serious consideration, before implementation, Given the:
- They are not so simple to implement using common software systems.
- Their cost is higher than transmission alternatives.
- Depending on the purpose for which they are to be used, It may be preferable to opt for an intermediate option between streaming and real time, such as the one used by Amazon in its web portal and that guarantees a result that does not exceed one hundred or two hundred milliseconds in any case for the 99% of all requests


