Data Lakehouse

SaaS data shapes and graphics over the image of a woman in tech

A data lakehouse combines the data storage repository of a lakehouse with an integrated data warehouse for analytic processing. Metadata and a data catalog that describes the data sets and their interrelationships link the data lake and the data warehouse.

Why is the Data Lakehouse Important?

Before the development of the data lakehouse, data lakes and data warehouses existed in siloes. Users had difficulty finding the data they needed without metadata and a data catalog. This leads to the underutilization of data warehouses and data lakes that fed data to data warehouses. Data engineers moved data from lakes to warehouses using complex extract transform load (ETL) pipelines. When unified, it facilitates better data utilization, resulting in the business getting more value from its data.

Who are the Users?

The primary users are data engineers and data scientists. Thanks to quality metadata, data analysts can also use it because they can more easily find data for analysis.

What are the Key Elements of a Data Lakehouse?

Storage

It contains structured data stored in a database as tables and semi-structured data formats such as JSON strings. Flat files store unstructured data such as video, audio, and text streams.

Data Catalog

The data catalog stores metadata that describes the data format, labels lineage, and more.

Data Connectors

Data connectors provide access to all the data sources for it.

APIs

Applications, utilities, and business intelligence (BI) tools use application programming interfaces (APIs) to access data in it.

Establishing Data Integrity

The data warehouse uses primary and foreign keys to maintain the coherence of data relationships so that when you make changes to data in one place, those changes are reflected in other related records. Data contained in a file system relies on data cleansing, validation, and transformation rules to state whether NULL values are valid. Validation scans can catch logical data corruption.

Data Governance

It aids data governance by recording who is responsible for the data, tracking the freshness of the data, and rating how authoritative the data is.

Data Quality

Data quality ensures that users can trust data. Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose.

Data Lakehouse Benefits

It provides many benefits, including the following:

  • The business gets more value from data documented using metadata because users can find and use it.
  • It is more accessible than a data lake because the lakehouse provides context about how different sets of data are related.
  • It promotes stronger data governance, which improves compliance and reduces risk.
  • Role-based access controls (RBAC) help protect data in the data lakehouse.
  • It centralizes administration versus federated, distributed data stores.
  • The data lakehouse encourages self-service analytics thanks to the integrated data catalog.
  • Unlike a data lake, the data lakehouse has a data catalog that documents how different data sets interrelate.
  • Machine Learning (ML) can often make better predictions using a data lakehouse that stores complete data sets.

Actian and the Data Lakehouse

The Actian Data Platform makes it easier to create a high-performance data lakehouse. Its integrated columnar, vectorized database uses a parallel query capability that is superior to that of a traditional data warehouse.

The Actian Data Platform supports hybrid and multi-cloud with on-premises, AWS, Azure and Google Cloud deployments. The vector database can access data stored in file systems using its Spark connector and can access multiple distributed database instances in a single query.

Built-in data integration features can profile data, automate data preparation steps, and support streamed data sources. The data integration capabilities offered by the Actian Data Platform work with popular data storage structures, including S3 buckets, Google Drive folders, and Azure Blob storage.