Data Lake

Data Lake Analytics

find an abundance of information in a data lake

Data Lake Analytics: What is it and Why Does it Matter?

Traditional data processing is becoming a legacy data process in the context of dealing with the ever-expanding scope of big, edge, and real-time data use cases that are increasingly business-critical. Today, Big Data, cloud, and edge computing technologies have transformed many slow, limited, and manual data management practices into digital transformational practices. The complexity of managing large volumes of structured, semi-structured, and unstructured related data has to continue to be automated and simplified as much as possible. Big Data challenges are here to stay, and where data is generated and processed, and how fast it’s growing, are all rapidly changing. Organizations have to embrace Big Data and data analytics capabilities or risk becoming optional to their customers.

Technologies such as various data warehouses and data lakes help manage Big Data. As data lakes have moved from Hadoop and proprietary on-premise environments to the cloud, they have helped to overcome the limitations of data warehouses and can work together with them for a more valued solution.

Microsoft’s Azure for Data Analytics or Azure Data Lake Analytics (ADLA) is one data lake solution that works in a distributed, cloud-based data processing architecture to help organizations manage their Big Data workloads. What is data without analytics? Azure data and analytics together make up a winning solution for organizational decision support needs.

What is Data Lake Analytics?

Data stored in a data warehouse is designed and fit for specific purposes; data stored in data lakes is fit for undefined or any purposes. Data warehouses store processed and refined data, whereas data lakes store raw, unprocessed data. Data warehouse analytics and data lakes analytics differ in that the data is already processed for a specific purpose using data warehouse analytics. The data lake is processed for particular usage as input data for a data warehouse when using data lake analytics.

Data lake analytics is a concept that’s been around since the inception of Hadoop. Hadoop is an open-source solution for storing and processing Big Data. Hadoop has a distributed file system (HDFS), a way to manage, monitor resources, and schedule tasks (YARN), data mapping for output results, and a standard Java library to support needed data decision outcomes. Hadoop consists of many tools and applications to collect, store, process, analyze and manage Big Data. Hadoop and data lake analytics are complementary components of data lake architectures. Hadoop is a platform for building data lakes. Although Hadoop is a primary platform for data lakes, Hadoop could be replaced as the platform for data lakes in the future as the technology evolves.

Think of the architecture in simple terms: Hadoop is the platform, a data lake is built on the platform, data lake analytics extract data for any purpose, and a data warehouse can be one of those purposes.

Azure analytics services enable faster Big Data analytics. Data lake analytics initially consisted of three key components:

A distributed file system – often called object storage;
Data processing and analytics tools – in the case of Hadoop: Hive, Pig, Mahout, and Impala provide a set of analytics tools;
And, for the overall management of the data lake analytics platform – with Hadoop, YARN.

Unlike the Hadoop data lake analytics platform, which once dominated but is now fading, the other three primary data lake analytics platforms are public cloud services rather than largely on-premise platforms. Although Hadoop can currently be deployed in the cloud, anyone doing so isn’t greenfield and is forced to evaluate going to public cloud offerings – at least for the following underlying object stores.

  • Azure Data Lake Analytics (ADLA).
  • Amazon Web Services (AWS) data lake analytics.
  • Google Data Lake Analytics (GDLA).

In all cases, there are equivalent sets of data processing, analytics tools, and core underlying data management systems. For Hadoop, it’s the Hadoop file system, HDFS, but the equivalents in the cloud are:

  • Cloud object storage.
  • Azure Data Lake Store (ADLS).
  • AWS Simple Storage Service (S3).
  • Google Cloud Store (GCS).

In many cases, you can still use YARN, Hive, Pig, and other Hadoop tools on these object stores instead of HDFS. Using the object stores value has to be designed, created, and delivered for the organization and yields a great combination of standardization of the underlying data storage while allowing flexibility to use a wide range of data analytics tools.

Data lake analytics discover and create relationships, answer business questions, chart new innovations in science and engineering, predict outcomes, automate and enable decisions. Factual meaning is given to data no matter the source, then information and knowledge are discovered for the purpose of improving the organization’s ability to make fast timely decisions to support their activities with their customers. Overall, data analytics, especially Big Data analytics and edge computing, are essential factors and capabilities organizations need to take advantage of today. Data more effectively and accurately drive both automated and human decisions.

Creating Value With Big Data Platforms

Big Data technologies extract, analyze, transform, and load large amounts of data that is too large for traditional data processing application software for statistical decision support across an organization. This data extracted from various sources are used to understand market conditions, social media intelligence, improve customer acquisition and retention, give historical insights and other usages for overall business intelligence. The more data collected and transformed for decisions, the more valuable the data becomes to an organization.

But what makes each of the AWS, ADLS, GCS platforms valuable is the ability to use the data integration, management, and analytics tools from Azure, AWS, and Google plus third-party equivalent offerings drawn to the platforms by the gravity of the big three cloud service providers.

What’s missing from these platforms is the ability to purchase a virtual data lake analytics service that spans multiple cloud providers and on-premise environments. Further, even for each cloud data lake analytics provider, emphasis on leaving the raw data in its natural state until a specific group and project wants to use it coupled with the technical nature of the groups using data lake analytics has deprecated integration functionality within these platforms. This challenge can be addressed with purposefully integrated architectures that feed enterprise data warehouse for specific purposes.

With the integration of Machine Learning (ML), Artificial Intelligence (AI), and Business Intelligence (BI) into an overall Big Data platform solution, the capabilities and necessities of Azure Big Data Analytics become more apparent and powerful for the organization. Creating and realizing value begins with keeping the end goal in mind for the solution being built using Big Data technologies.

Key Capabilities of Azure Data Lake Analytics

Data lakes have key capabilities for extracting data from various sources, storing large amounts of data, transforming data, providing security and governance, analytical services, and data lake analytics tools. Azure data lake analytics architecture has the following benefits:

  • HDFS compatibility and optimized for performance and high throughput.
  • Unlimited data size – Binary Large Object storage (BLOB) for text and binary data.
  • Fault tolerance, rapid response to system failures.
  • High Availability and disaster recovery.
  • Enablement of Hadoop in the cloud.
  • Integration with Azure data lake analytics active directory for role-based access needs.
  • HIVE and Spark support.

Add on the capabilities of Microsoft Azure data lake analytics, which include introducing U-SQL. U-SQL, created by Microsoft primarily for Azure, is a big data query and processing language that combines the construct and capabilities of SQL and C#. It is a straightforward language to use that includes rich types and expressions. Besides working on unstructured data, U-SQL provides a general metadata catalog in a relational database form. U-SQL meta catalog works like HIVE and supports database schemas, tables, indexes, views, functions, procedures, and .Net assemblies. Besides U-SQL, R, .Net, and Python are also supported with Azure data lake analytics.

In addition to the power of U-SQL, Microsoft data lake analytics, other key capabilities include:

  • Faster developments using U-SQL.
  • Compatibility with all Azure Data.
  • Cost-effectiveness.
  • Dynamic scaling.
  • Smart optimization.

Capabilities of Microsoft Azure data lake analytics also include complementary services such as:

  • Cosmos DB – Multi modal NoSQL database service.
  • Azure SQL Data Warehouse – Cloud enterprise data warehouse Azure SQL database – Managed relational database service.
  • Azure Data Factory – Extract/Retrieve, Transform, Load/Output (ELT) and data integration service.
  • Azure Analysis Services – Managed analytics engine for model building.

The capability for organizations to be successful relies on their assets and capabilities of those assets. Organizations have to acquire the ability to manage their Big Data then turn the knowledge into a strategic capability. The capabilities listed of Azure data lake analytics can be enabled uniquely within an organization to create a competitive advantage. Amazon and Google offer analogous architectures, functionality, and a diverse set of third-party offerings to build out extensive ecosystems for modern Big Data and analytics use cases.  Organizations should assess their Strengths, Weaknesses, Opportunities, and Threats (SWOT) and develop a strategic, tactical, and operational plan for success with Big Data abilities and capabilities.

Conclusion

Many organizations have challenges understanding their customers’ needs. Organizations use expert opinions of their employees, take surveys, and use other means. Today, one of the most effective ways is to use data from any and every source possible to analyze any business process for effective, efficient, and economically actionable decisions by anyone in the organization. Omnichannel engagements and data collection from all sources have to be analyzed. Azure Data analytics and supporting technologies can help with this complex task of using Big Data and experts in the organization to make better customer decisions.

Recently the Apache initiative to build a delta lake that spans multiple data lakes has been a significant focus. Because this has been built on Spark, it has also added the ability to handle streaming data analytics, not just batch analytics. This is the approach taken by Databricks with their delta lake.

The equivalent answer to the functionality gap in current data analytics platforms is to make the cloud data warehouse a better downstream destination for data analytics that does the data analytics within the data lake.

This has been the approach of cloud data warehouses from vendors like Actian that are integrating into their data integration products to create a flexible schema-on-the-fly front-end to their cloud data warehouse. This effectively did the same thing as a delta lake but focused on operational analytics use cases for data lake analytics versus more research project cases upstream from day-to-day workloads and business processes.

The Actian Data Platform can help organizations with an outcome-based architecture for extracting the power of data lake analytics for organizational timely decision support.