What is a Data Lake?
A data lake is used to store and process structured or semi-structured data in its native format. Businesses create and receive large volumes of data that you can store in full fidelity in a data lake for analysis.
Why Use a Data Lake?
Businesses need a central place to store all the potentially valuable data collected by operational systems, weblogs, and edge systems. A data lake complements data warehouses which are used to support regular reporting requirements. A data lake has a broader scope in providing a repository for data used infrequently or to support infrequent projects such as audits and exploring new business initiatives.
Managing an extensive database can be challenging, so vast quantities of data are readily available in a cloud store such as Amazon S3.
Not all data formats can be accommodated within a relational database management system. Hence, a data late provides a repository for all data formats in a single file system or cloud store. The Actian Data Platform provides a standard SQL-based front end for a data lake that uses extensions that allow external data to be accessed using the same APIs that are used for internally stored tables.
Data Lake Characteristics
The following attributes can characterize data lakes:
- To store structured, semi-structured and unstructured data.
- Can be centralized onto a single file system or cloud storage platform.
- To be called a data lake, they need to store large quantities of data ranging from 100 terabytes and up.
- Often uses HDFS, HBASE or Spark interface.
- It may be accessed using Spark extensions by data management platforms such as the Actian Data Platform.
- They simplify data management.
- Prepare to incorporate Artificial Intelligence and Machine Learning.
- They speed up analytics projects.
- They improve security and governance.
Data Lake Architecture
Data lakes tend to be big, so they need a scalable file system. One way to ensure that is to use a file system shared by multiple servers in a clustered configuration connected using a high-speed network. The Hadoop HDFS is highly scalable and offers high availability by maintaining redundant copies of data to protect against device failure. A master node distributes data across multiple servers to spread resources evenly across the cluster of servers.
Traditional Hadoop clusters used to be on-prem only. As the cost per terabyte of cloud-based storage and lower administration costs have dropped, it is now the preferred place to build a data lake architecture. Cloud storage supports high-volume ingestion and consumption at a greater scale than home-grown clusters. Infrastructure-as-a-service (IaaS) is available from all the leading public clouds from Google, Microsoft Azure and Amazon Web Services.
The Actian Data Platform supports Hadoop data formats such as AVRO, ORC, PARQUET, XML, and JSON.
Best Practices of Data Lake Implementation
Data lakes help lower the total cost of ownership and simplify data management by consolidating multiple disparate data silos.
Data does not have to be left in its raw form in the data lake as it can be pre-processed or transformed to prepare it for Artificial Intelligence (AI) and Machine Learning (ML) applications.
Most definitions of a data lake describe it as a singular data store. Data is usually generated in many places in a business and should be consolidated and federated across a small number of pools to keep management costs down. The reason for doing this is that it takes time and money to move data around, especially when you consider egress costs from a cloud platform.
A pragmatic approach would be to assess where data is generated and select a handful of places where it can be pooled. The data pools can be virtually consolidated into a single view to provide a comprehensive view. Examples of practical pools would be one per cloud platform, and by geography, so you might have pools for Asia, US and EMEA data centers and aggregations on Azure, AWS and Google Cloud.
Analytic engines can be deployed at each pool so you can look at fresh data with low latency and use the federated global view for functions such as monthly or annual reporting.
The Actian Data Platform is available for every cloud platform and on-prem to enable you to perform data analysis wherever it makes sense to locate your data lake. Only a subset of the data needs to reside within the Actian Data Platform. Infrequently used data sets can still be accessed through the Spark connector and have SQL predicates pushed down to the source dataset to improve query efficiency.
To provide transparency, data in the lake needs to be tagged with metadata that identifies the source system, data quality, age, and frequency of updates to aid in ascertaining how trustworthy it is. This metadata can be cataloged to aid with data provenance. Cataloging data sets improves governance and auditability of the data for regulatory purposes.
The Hadoop HDFS architecture is particularly well suited for a data lake use case due to its durability and clustered architecture used to distribute tasks across nodes.
Data Lake vs. Data Warehouse
The data lake concept is more recent than the data warehouse that rose to popularity in the 1990s. The data lake is an evolution of the big data notion. Data lakes and data warehouses complement each other. They can co-exist. Data lakes are designed to be a repository for full-fidelity data in all forms. In contrast, a data warehouse is designed to make prepared data available to users for inactive querying, BI tool dashboards, visualization, AI and ML tools.
Data management platforms such as the Actian Data Platform provide a hybrid data warehouse that can access external data stored in a data lake.
Actian and Spark
Existing Hadoop-based data lakes can be accessed using SQL with Hive and Apache Impala, but it is possible to gain orders of magnitude better performance by loading the data into the Actian Data Platform. Additionally, the Actian Data Platform provides the ability to query Hadoop data in place using external tables. The external tables feature lets you read from and write to data sources stored outside of the Actian Data Platform. The data source must be one that Apache Spark is able to read from and write to, such as files stored in formats like Parquet, ORC, JSON, or tables in external database systems.
The syntax CREATE EXTERNAL TABLE creates an Actian Data Platform table that points at existing data files in locations outside of the Actian Data Platform. This feature eliminates the need to import the data into a new table when the data files are already in a known location, in the desired file format.
After the data file structure is mapped to the Actian Data Platform format using the CREATE EXTERNAL TABLE statement, you can:
- Select, join, or sort external table data.
- Create views for external tables.
- Insert data into external tables.
- Import and store the data into the Actian Data Platform database.
The data is queried from its original locations, and the Actian Data Platform leaves the data files in place when you drop the table.
The Actian Data Platform leverages Apache Spark’s extensive connectivity for external tables functionality.
The Actian Data Platform queries operating on external tables from the user or client app and rewrites them into JSON requests for external data. These requests are then translated into Spark jobs, which are launched. These jobs typically issue queries (to SparkSQL) like “INSERT INTO avalanche_table SELECT * FROM external_resource” for reading external data or “INSERT INTO external_resource SELECT * FROM avalanche_table” for writing to external systems. Finally, these jobs push and pull data in and out of the Actian Data Platform.
You can even insert new records into an external table.
The syntax for defining an external table is:
CREATE EXTERNAL TABLE table_name (column_name data_type {,column_name data_type}) USING SPARK WITH REFERENCE='reference' [,FORMAT='format'] [,OPTIONS=('key'=value {,'key'='value'})]
You can get more information, see CREATE EXTERNAL TABLE in the SQL Language Guide.
Reading and Writing to an External Table
After external tables are defined with the CREATE EXTERNAL TABLE syntax, they behave like regular Actian Data Platform tables. You can issue queries such as the following:
SELECT * FROM test_table_csv INSERT INTO my_table_orc SELECT some_column FROM other_table
Predicate Pushdown from the Actian Data Platform to Spark
If a SQL query is issued to an EXTERNAL TABLE using Spark, predicates from the WHERE clause may already be evaluated by Spark, reducing the number of tuples sent. Only predicates that can be translated into a column value range are supported. For logical connections of simple predicates IN and AND are supported on a single column, and across columns, OR is supported only on a single column.
Where the whole complex predicate contains a single OR spanning different columns, nothing is pushed down to Spark. In this case, Spark transfers all tuples to the Actian Data Platform, and the filtering is done solely on the Actian Data Platform side.
CREATE EXTERNAL TABLE Examples
- Define an external table for a CSV data source residing in Amazon S3:
CREATE EXTERNAL TABLE ext_csv (col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL) USING SPARK WITH REFERENCE='s3a://<bucket>/file.csv' OPTIONS=('DELIMITER' = '|', 'HEADER' = 'TRUE','SCHEMA'='col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL');
- Define an external table for a CSV data source residing in Amazon S3. The CSV data does not have a header row:
CREATE EXTERNAL TABLE nation_s3 ( n_nationkey INTEGER NOT NULL, n_name CHAR(25) NOT NULL, n_regionkey INTEGER NOT NULL ) USING SPARK WITH REFERENCE='s3a://<bucket>/nation.csv',FORMAT='csv', OPTIONS=('HEADER'='FALSE','DELIMITER'='|','SCHEMA'='n_nationkey INT NOT NULL, n_name STRING NOT NULL, n_regionkey INT NOT NULL');
- Define an external table for an ORC data source from Azure:
CREATE EXTERNAL TABLE my_table_orc(a INT8 NOT NULL) USING SPARK WITH REFERENCE='abfs://loadtest@avalanchetest.dfs.core.windows.net/my_table.orc';
- Define an external table for an AVRO data source from Azure:
CREATE EXTERNAL TABLE tweets (username VARCHAR(20), tweet VARCHAR(100), timestamp VARCHAR(50)) USING SPARK WITH REFERENCE='abfs://loadtest@avalanchetest.dfs.core.windows.net/twitter.avro', FORMAT='com.databricks.spark.avro'
- Define an external table for a JSON data source using the following JSON file:
[{ "symbol": "MSFT", "company": "Microsoft Corporation", "stock_tstamp": "2020-01-31T21:00:00+00:00", "price": 170.777, "volume": 36142690, "sector": "TECH" }, { "symbol": "AAPL", "company": "Apple Inc.", "stock_tstamp": "2020-01-31T21:00:00+00:00", "price": 309.51, "volume": 49897096, "sector": "TECH" }, { "symbol": "GOOG", "company": "Alphabet Inc.", "stock_tstamp": "2020-01-31T21:00:00+00:00", "price": 1434.23, "volume": 2417214, "sector": "TECH" }, { "symbol": "AMZN", "company": "Amazon.com, Inc.", "stock_tstamp": "2020-01-31T21:00:00+00:00", "price": 2008.72, "volume": 15567283, "sector": "TECH"}]
Example SQL to create an external table to reference that JSON file stored on an Azure ADL:
CREATE EXTERNAL TABLE techstocks ( symbol VARCHAR(4), company VARCHAR(20), stock_tstamp TIMESTAMP, price FLOAT, volume INTEGER, sector CHAR(5) ) USING SPARK WITH REFERENCE='abfs://stockquotes@eastusstockdata.dfs.core.windows.net/tech1.json', FORMAT='json', OPTIONS=( 'multiline'='true', 'SCHEMA'= 'symbol string, company string, stock_tstamp string, price double, volume integer, sector
- Define an external table for a CSV data source residing in Google Cloud Storage. The CSV data has no header row:
CREATE EXTERNAL TABLE stations ( n_stationkey INTEGER NOT NULL, n_stationname CHAR(25) NOT NULL, n_locationkey INTEGER NOT NULL ) USING SPARK WITH REFERENCE='gs://<path>/stations.csv',FORMAT='csv', OPTIONS=('HEADER'='FALSE','DELIMITER'='|','SCHEMA'='n_stationkey INT NOT NULL, n_stationname STRING NOT NULL, n_locationkey INT NOT NULL');
- Remove all rows where the text column contains NULL character. The filtering is done entirely on the Spark side with possible predicate pushdown to the data source:
CREATE EXTERNAL TABLE filter_test(id INT, text VARCHAR(20)) USING SPARK WITH REFERENCE='test.csv', FORMAT='csv', OPTIONS=( 'header'='false', 'schema'='id Integer, text String', 'filter'='text NOT LIKE "%\u0000%"');
- Create an external table test_ext referencing a table test in another Actian Data Platform instance:
CREATE EXTERNAL TABLE test_ext(id Int, text VARCHAR(20)) USING SPARK WITH REFERENCE='dummy', FORMAT='vector', OPTIONS=( 'host'='localhost', 'port'='VW7', 'database'='testdb', 'table'='test', 'user'='actian', 'password'='actian', 'staging'='select id, replace(text, "e", "i") as text from THIS_TABLE', );
Staging SparkSQL creates an intermediate table that does string replacement. THIS_TABLE is a designated keyword.
- Map schema by simply unnesting a nested parquet file:
CREATE EXTERNAL TABLE unnested(id INT, name VARCHAR(20), surname VARCHAR(20)) USING SPARK WITH REFERENCE='nested.parquet', FORMAT='parquet', OPTIONS=( 'SCHEMA'='id integer, fullname struct(name string, surname string)', 'STAGING'='select id, fullname.name as name, fullname.surname as surname from THIS_TABLE' );
Create your data lake environment and evaluate the capabilities of the Actian Data Platform for yourself.