Data Warehouse

Data Lakehouse

Two colleagues collaborating on a laptop, showcasing the efficiency of a data lakehouse for unified data storage and analytics in a modern workspace.

A data lakehouse combines the data storage repository for the raw data function of a data lake with an integrated data warehouse for analytic processing. These are considered separate entities, but the data lakehouse combines the two systems using metadata and a data catalog to describe the data sets and their interrelationships.

Why is the Data Lakehouse Important?

Before the emergence of the data lakehouse architecture, data lakes, and data warehouses existed in separate siloes, data had to be moved and transformed from data lakes to data warehouses using sometimes complex data pipelines. Users had difficulty finding the data they needed, leading to the underutilization of data warehouses and the data lakes that fed them. Integrating the raw data repository and the data warehouse into a unified data lakehouse increases data utilization, so the business gains significantly more value from its data assets.

The data lakehouse is an answer to data lakes, which often get neglected and forgotten, turning into data swamps. Many organizations created Hadoop data lakes in their heyday, only to lose skilled administrators when the excitement about the concept wore off, leading to their demise.

What are the Components of a Data Lakehouse?

Storage

A data lakehouse is a structured data repository stored in a data warehouse as tables and semi-structured data formats such as JSON strings. Flat files store unstructured data such as videos, audio files, and text documents stored in file systems. These can be traditional on-premise file systems or cloud file stores such as AWS S3.

The Data Catalog

The data catalog stores metadata that describes the data format, labels lineage, and more. The catalog helps users find the data they need, thanks to searchable descriptions.

Data Connectors

Data connectors provide the means to access all the data types in the data lakehouse. Connectors such as Spark can access multiple data formats using a standard interface.

Application Programming Interfaces – APIs

Applications, utilities, and business intelligence (BI) tools use APIs to access data in the data lakehouse.

Data Lakehouse Consumers

Thanks to the quality of the metadata contained in the data lakehouse, citizen data analysts can easily run BI queries to generate reports and populate visual dashboards. The data is easier to find and load into the data warehouse for analysis. Related data is linked so they can explore it without the aid of data professionals.

Data Integrity Controls

Trustworthy data can either be excluded from the data lakehouse or flagged as low quality in the metadata description. Referential integrity controls in the data warehouse that enforce primary and foreign key constraints help to maintain the coherence of data relationships. Data contained in file systems can be scanned to catch logical data corruptions that can creep in.

Data Governance

The data lakehouse construct supports data governance initiatives by recording who is responsible for the data—tracking the quality and freshness of the data and rating how authoritative the data is. Proactive data governance ensures the organization controls data sprawl by focusing users on trustworthy data.

Data Quality

Low-quality data is worse than no data as it can result in misleading insights. High-quality data has no gaps, uses uniform formats, and is verified. Maintaining data quality is a fundamental requirement of a data steward.

Benefits of a Data Lakehouse

The data lakehouse concept is growing in popularity due to many of the reasons below:

  • Well-documented and easy-to-find data is more likely to be used in analysis and decision-making.
  • By putting data in a data lakehouse, users can trust it.
  • Relationships between different data sets are spelled out in a data lakehouse, making it more likely to be consumed.
  • Compliance, Data Governance, and Data Stewardship are enforced, increasing trust and reducing risks.
  • Increased security can be implemented using role-based access controls and authentication of users of the data lakehouse.
  • Administration costs are lower for a single unified repository than for multiple distributed siloed data stores.
  • The data lakehouse encourages self-service analytics because data is described and cataloged.
  • API access makes the data lakehouse accessible to machine learning (ML) models.

About the Actian Data Platform

The Actian Data Platform deployment flexibility allows data to be managed and analyzed on-premise and in multiple public cloud platforms.