Data Lake

What is Data Lake Storage?

find an abundance of information in a data lake

What is data lake storage, and why does it matter? A data lake is a repository built by private organization IT departments or public cloud providers for the storage, processing and maintenance of data in any format and from any source, such as video, newsfeeds, your applications, web scraping, IoT, data marts, data warehouses or mobile devices. In 2010, then CTO of Pentaho, James Dixon, contrasted data marts and data lakes. Data marts or Data warehouses stored and allowed for data analysis based on known schema attributes.  On the other hand, a data lake provides for interrogation based upon any number of details contained within the acquired data. Data lake storage allows you to store most any type and size of data and subsequently search for something and not be sure of what your search will find nor exactly what the data’s format will be.

Data Lake Storage in a Modern Data Architecture

The design and management of data storage have historically been the most costly and challenging aspect of IT. As the variety of data types and sources increased, especially with most organizations presenting their services digitally over the internet, this complexity has led to modernizing data architecture (people, processes and tools). Consider that a few years ago, any data had to fit a rigid schema and was therefore highly structured, whereas today, data will be semi- or unstructured and therefore often unformatted.

Twenty-five years ago, 1TB of data storage required three large racks of disk drives, each the size of a small washing machine. Today, data lake storage provides the opportunity to have petabytes of data at your disposal – either physically in a small desktop enclosure or more likely virtualized in the cloud. Good news or a security and management nightmare? What information does the organization want to extract from its stored data when it is analyzed? The information contained within this stored data helps enterprises service customers with exceptional products but, understanding what data the organization has, how, where and when it was acquired, and who can access it are key architectural considerations.

Best practices for modern data lake storage architecture are:

  • Know what you have by using a combination of catalogs (think about the library card system) with each record comprised of metadata quickly defining each piece of data within the lake, its source, date of acquisition and other attributes to simplify data queries and archival.
  • Audit software and active governance of what you have, why you have it, is the way you have it or received it legal, who is using it, and when can you delete it.
  • Access Control Lists (ACL) and other security practices are designed and governed for each data lake (see the section on Microsoft Azure data lake storage for more).
  • Cloud data lakes encrypt data as part of their initial intake. The skill required to use this information or transfer it in an encrypted state requires specialized software skills and changes to applications and service designs. Not only for the data lake owners but any partner or customer that share information and security tokens. Where the tokens will be stored and who has access is a modern data lake storage architecture design priority.

What is the Difference Between a Data Lake and a Data Warehouse?

Data warehouse storage was the original storage option strategy where you knew what you had, what it looked like and which specific data each application, database, datamart and other source systems delivered to it or needed to retrieve from it. Because data warehouses focused on the aggregation of structured data from departmental operational databases, they were also very structured.  And while they could be one or two orders of magnitude larger than the largest database they pulled data from, even the aggregate dataset sizes were no more than tens of terabytes if that.  Over time, as new types of data required historical aggregation, web clickstreams, archived documents, video surveillance data and other data types and sources, data warehouses seemed ill-suited as they could not absorb the massive data size associated with these non-traditional data sources.  Further, other departmental data repositories were too narrowly defined in their functions: document management systems worked only for documents, video surveillance systems only for video storage, and so forth.  The quest for a centralized yet multi-faceted data repository that wouldn’t run out of storage room led to the introduction of virtual storage (VMWare, NetApp, etc.) and facilitated the creation of cloud data storage and data lake options.

To understand data lakes, you need to return to 1992 when Ralph Kimball and Bill Inmon coined the term data warehouse to describe the rules and schemas that would control data warehouse architecture designs for decades to come.

The Wikipedia definition of a data warehouse highlights its use and weakness: “central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”

The following table highlights the main differences between data warehouse storage and data lake storage:

Attributes Data Warehouse Data Lake
Data traits
  • Relational transactional systems, operational databases, and line of business applications
  • All sources are known before placement into the warehouse
  • Best used for data containing personal identification information (PII) data
 

  • Supports non-relational and relational data from IoT devices, websites, mobile apps, social media, and corporate applications
  • All data accepted if it can pass the security to enter the lake
  • No support for transactional data if compared to a warehouse
Data usage Reports, transaction management, business intelligence, dashboards Analysis and Modelling, Artificial Intelligence, Profiling
Cost, speed, reliability Fastest query results using higher cost storage Query results faster than using other storage options, but the lake can become a swamp if not correctly managed, negating performance capabilities
Data Quality Highly curated data that serves as the central version of the truth Any data that may or may not be curated (i.e. raw data)
Users of Data Business analysts, Line-of-Business Power Users Data scientists, Data developers, and Business analysts (using curated data)
Skill needed to use Data Data Engineers for architecture, EDW standup, ongoing management, Database administrators required to creating scripts, manage users, configuration, tuning Data Engineers for architecture, lake standup, ongoing management, Developers, Data analysts and modelers needed to profile, process, analyze data
Challenges Difficult to change schemas or reports without changing the structure of the data warehouse  

  • It can become a swamp of data as you accept things you do not need.
  • More challenging to secure.
  • Complex requires significant technical support to use
  • Easier to break regulatory rules

The truth is that you will need and use both data warehouses and data lakes. Standard, fast and repeatable queries from a known and well-defined dataset benefit from the capabilities of a data warehouse. Analytics and modelling where the sources of data are disparate; will require a data lake. But consider, in 2017, Aberdeen did a survey that showed how businesses who used data lakes outperformed their competitors by 9%. There are caveats for creating and using data lakes, but the benefits outweigh the risks.

Actian Data Platform is designed to deliver high performance and scale across all dimensions – data volume, concurrent user, and query complexity.

Microsoft Azure Data Lake Storage

Microsoft Azure data lake storage Gen1 (ADLS Gen1) was the response to customers needing a way to store information in a variety of formats for analytical purposes. ADLS Gen1 provided:

  • Elastic, scalable storage.
  • Azure HDInsight provides Apache Hadoop, Spark, HBase and Storm clusters.
  • Built-in resilience (though Azure Data lake Gen1 did not provide this to the extent of Azure Blob storage or other Azure data storage options).
  • No limit on the type of data placed into Azure data lake storage.
  • Encrypted Master Key or Data Block Key storage within ADLS Master Key Vault.
  • Easy integration with most other Azure offerings.
  • Analytics software based on Apache YARN with on-demand processing power.
  • Integrated Azure Active Directory file services supporting OAuth 2.0, multi-factor authentication, access control lists, role-based access lists, and POSIX.
  • Automated event management to trigger analytics or other programmatic activities.

Microsoft Azure data lake storage has no upfront costs, instead of allowing you to pay less than you usually would for large amounts of storage while reducing the transaction costs, read and write, of that data. ADLS is a pay-as-you-go approach, but given this flexibility, it needs to be monitored to control the costs versus the benefits of ADLs.

Microsoft Azure Data Lake Storage Gen2

In early 2019, Microsoft released Azure data lake storage Gen2 (ADLs gen2) with unlimited storage linked to powerful analytic software capable of running searches in parallel no matter the data type. ADLs gen2 is particularly useful for analyzing BLOB (Binary Large Object) or video files combined with other data types. Azure data lake storage Gen2 has all the features of ADLS Gen1 plus:

  • Azure Active Directory (AAD).
  • Hierarchical File System (HFS) to group files within any number of operating systems.
  • Read-access geo-redundant storage to improve business continuity.
  • BLOB tiers of Hot, Cool and Archive storage to fulfil business continuity requirements.
  • Reduced storage costs by as much as 50% over ADLS Gen1 or Azure Blob.
  • Simplifying the transition from ADLS Gen1 to Adls gen2 by enabling a switch from an Adls gen2 control menu.
  • Vastly increasing query and data load performance by using metadata to track every instance and attribute of information (think of how finding a book in a library was eased as book catalogs were automated).
  • Securing data at the directory and file level making it POSIX-compliant or via access control lists, role-based access (RBAC) and other best-practice methods.
  • Integrated encryption for data at rest or in transit linked to customer-managed keys or those maintained in Microsoft Key Vault.

Planning for Microsoft Azure Data Lake Storage Gen2

There are numerous data acquisition and ingestion methods and a variety of uses servicing a global customer community. The challenge is to maintain only one data lake to fulfil any analytic request or create a multiple data lake storage environment.

The costs of ADLs gen2 is a combination of storage and transaction costs. Guidance can be found here or by asking Microsoft Azure technical support. Many Azure services such as Azure Stream Analytics, IoT Hub, Power BI, and Azure Data Factory are now part of Azure data lake storage Gen2.

Data security is paramount, and ADLs gen2 is ISO compliant and supports most firewalls or network configurations, as seen in Microsoft guidance material. Another crucial data management best practice is ensuring that data is accessible, regardless of continuity event. ADLs gen2 stored data is replicated three times, and can be resilience can be improved by choosing the following options as seen in Microsoft’s Azure Storage redundancy webpage:

  • Locally-redundant storage (LRS).
  • Zone-redundant storage (ZRS).
  • Geo-redundant storage (GRS).
  • Read-access geo-redundant storage (RA-GRS).

Google and AWS Data Lake Storage

While this article has focused on Azure, Google and AWS offer excellent alternatives.

Google Cloud data lake offers a scalable solution based on Google Cloud Storage. There are two data ingestion services: Dataflow for automated data transfer and provisioning and

Cloud Data Fusion fully manages your data ingestion and governance. To facilitate fast analytics, data lake Google storage uses Dataproc to modernize data architecture, ETL, and open-source products on Apache Spark. The primary analysis tool is BigQuery for Machine Learning (ML) or research of petabytes of data via ANSI SQL.

AWS data lake storage offerings, similar to Google and Microsoft Azure, include managed services and various cloud storage and analytic tools options. Amazon S3 (S3 stands for Simple Storage Service) provides the core elastic storage repository for Amazon Data Lake Storage and is widely used as the external Cloud data repository not just for Amazon Data Lakes but also for most any Cloud Data Warehouse as a data staging and ingestion platform.  Using a console approach, users can build data lakes on the fly integrating data from several sources into one S3 cloud location. AWS data lake fully supports AWS Lambda.  Data lakes require a powerful search engine to find information, and this is performed via Amazon OpenSearch Service. Security, authentication, and governance management system are executed by  Amazon Cognito. Data transformation and analysis are enacted via Amazon Glue and Amazon Athena.

Data warehouses serve one function of fast, columnar or understood data management and research. Data lakes are cloud storage options for various data, including data warehouses, that are tagged for ease of management with metatags. The choice of which data lake to pick is, unfortunately, not clear-cut and is based on the requirements of your organization. Best practice suggests that you pilot the alternatives or undergo a thorough use-scenario case set of examples to ensure the solution fits your digital and analytic needs.

Actian is a Fully Managed Data Platform

It is designed to deliver high performance and scale across all dimensions – data volume, concurrent user, and query complexity – at a fraction of the cost of alternative solutions. Actian Data Platform can be deployed on-premises as well as on multiple clouds, including AWS, Azure, and Google Cloud, enabling you to migrate or offload applications and data to the cloud at your own pace.