What is a data lake and what is it for?

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

One of the great technological challenges that companies have to face is, undoubtedly, data growth. Who hasn't heard of terabytes, petabytes y exabytes? Today they are common terms in almost all sectors, especially when it comes to storage capacity.

One thing is clear: The emergence of new technologies on the Internet has resulted in excessive access and storage of information from both customers themselves and potential customers. Y, given the large amount of data, it is essential to have a system that keeps them safe, like Data Lake.

What is a data lake?

In accordance with Amazon web services the definition of Data Lake is:

Centralized repository that makes it possible to store all structured and unstructured data at any scale. You can store your data as is, without having to structure them first, and run different types of analysis, from dashboards and visualizations to big data processing, real-time analytics and machine learning to make better decisions.

The term Data Lake (literally, data lake in English) was coined by James Dixon, Pentaho Chief Technology Officer, and refers to the particular nature of the data in this system, in contrast to clean and processed data stored in traditional data storage systems or Data Mart.

According to Dixon, “If you think of a Data Mart as a clean bottled water warehouse, packaged and structured for easy consumption, a Data Lake would be a large body of water in a more natural state. Its content comes from a source that fills the lake and several users of the same can approach to examine, dive or take samples ".

Data Lakes are generally configured on a cost-effective, scalable consumer hardware cluster, allowing data to be dumped into it in case it is needed later without having to worry about storage capacity. These clusters can exist locally or in the cloud.

Why use a data lake

According to the study Searching for knowledge in today's data lake And Aberdeen, Institutions that successfully generate business value from their data will outperform their competitors. In reality, companies that implemented a data lake outperformed their peers by a 9% in organic revenue growth.

Therefore, were able to perform new types of analysis, like machine learning, in new sources, as log files, clickstream data, social media and internet-connected devices stored in a data lake.

This helped them identify and act on business growth possibilities more quickly., attracting and retaining customers, boosting productivity, proactively maintaining devices and making informed decisions.

5 advantages of a data lake

Among the main benefits of a Data Lake are the following:

  1. It makes it possible to centralize all the data in one place, whatever its origin. Once included in their respective information silo, can be processed with Big Data tools. It is possible that in the face of such a disparity in information there are data that need special treatment with respect to security., but it is a solvable aspect with this system.
  2. The original source of the data may be out of date or disabled, but its content can still be valuable for analysis. With this system you can enter this information.
  3. All data that reaches the system can be normalized and enriched.
  4. The data is prepared according to the needs of the moment, which significantly reduces costs and times.
  5. Any authorized user can enter and enrich the information from anywhere, helping the organization to more easily collect the data needed to make decisions.

Data Lake frente a Data Warehouse

When talking about data storage, another concept related to the topic at hand usually arises.: el Data Warehouse o data warehouse. It is a database optimized to analyze relational data from transactional systems and line of business applications..

Despite this, even though both paradigms focus on data storage, there are some differences between a data lake and a data warehouse:

  • Data structure: a data warehouse only collects structured data, while a data lake collects structured and unstructured data.
  • Purpose of the data: this aspect may or may not be defined in a Data Lake, while in a Data Warehouse there is no room for improvisation.
  • Flexibility: in a Data Lake it is easier to make changes because it has no structure, but in a Data Warehouse it is more complex because other processes intervene.
  • Scheme: data lakes focus on schemas on read and data stores on schemas on write.
  • Users: in a Data Lake the data is managed by analysts, while in a Data Warehouse any user with access can manage the data.
  • Accessibility: while in a Data Lake there is great and easy accessibility, in a Data Warehouse this section is more expensive and complex.
  • Storage: a Data Lake has a limited cost with an opportunity for expansion in the cloud, while a Data Warehouse is generally more expensive.

As a last resort, both systems are intended for institutions that base their decisions on data and that can implement more personalized or customer-centric strategies and communications.

Azure data lake

Azure data lake is Microsoft's hyperscale repository for large cloud data analytics workloads. This service is designed for the cloud, supports HDFS (Hadoop Distributed File System) and scales without limits with massive performance and enterprise-grade capabilities.

Azure Data Lake solves many of the productivity and scalability challenges that prevent institutions from maximizing the value of data resources with a service that is ready to meet their current and future business needs..

Among the different services included in Azure Data Lake are the following:

  • Data Lake Analytics: Unlimited cloud analytics job service that enables you to develop and run parallel data transformation and processing programs using U-SQL languages, R, Python y .Net.
  • HDInsight: Apache Spark and Hadoop cloud service for companies that provides open source analytics clusters for Spark, Hive, Map Reduce, HBase, Storm, Kafka y R-Server, backed by a service level agreement from the 99,9%.
  • Data Lake Store: Unlimited cloud data repository for big data analytics that can be massively scaled and built based on the open HDFS standard.
Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.