Data Lake

Data Lake vs. Data Warehouse

find an abundance of information in a data lake

What is data lake storage? Is it the same as data warehouse storage? A data lake and a data warehouse may sound like they have the same uses — after all, lakes and warehouses both store things. While both are data repositories and can work synchronously together, each has a different purpose when collecting data for your business. Knowing their different functions, benefits, appropriate uses, and how they can be used together will help you decide what solutions to add to your data platform and help support your larger data governance framework. Let’s dive in! 

What is a Data Lake? 

A data lake is a data repository that accumulates large volumes of structured, unstructured, and semi-structured data in its full fidelity native format, including flat files, log files and spreadsheet exports. Think about how a lake can collect and hold water from various sources, all mixed together and unorganized.

Data lake products are good options for organizations that must collect large volumes of diverse data for processing. However, unstructured data stored in a data lake must undergo structured data conversion before it can be analyzed or used productively. 

What is a Data Warehouse?

A data warehouse is a relational database system containing data from multiple sources that is structured for easy analysis and decision support. Think about how inventory in a warehouse is carefully organized and cataloged in a way that makes it easier to find and access.  

A data warehouse is designed to make prepared data available to users for Business Intelligence (BI) dashboards, visualization, Artificial Intelligence (AI), and Machine Learning (ML) tools. However, unlike a data lake, a data warehouse architecture isn’t as effective at storing large volumes of data, especially in high fidelity.   

Data Lake vs. Data Warehouse: Which to Choose

Deciding whether to connect your data pipeline to a lake or a warehouse can be a nuanced decision that depends on your business’s data needs. And in some cases, you may find that your organization needs to utilize both, which we’ll discuss later. Here’s a breakdown of the main differences surrounding data warehouses vs. data lakes to help you decide.  

  Data Warehouses  Data Lakes 
Primary Function  Analysis  Storage 
Types of Data  Structured data that needs immediate analysis  Structured, unstructured, and semi-structured data that needs retention, future analysis, or audits 
Data Structure  Hierarchical Dimensions & Tables  Flat Architecture 
Processing Method  ETL (Extract, Transform, Load)  ELT (Extract, Load, Transform) 
Ideal Data Formats  Columnar (Parquet, ORC, Delta)  Apache Parquet, Apache Avro, Apache Iceberg, CSV, JSON 
Languages  SQL  SQL, U-SQL, Scala, Python, R, .NET, Apache Spark, TensorFlow, SciKit, MLib, MXNet, and more 
Schema  Schema-on-Write  Schema-on-Read 
Scalability  Low  High 
Cloud-Based Available?  Yes  Yes 
Who Uses Them?  Data Engineers, Data Scientists, Business Analysts, Metadata Architects  Chief Data Officers, Data Scientists, Business Analysts, Managers, Data Developers 

As you can see, there are many differences between a data warehouse vs. a data lake — not only in how organizations utilize them but also in how they function. While data lakes and data warehouses may seem interchangeable, which one you choose for your business needs can depend on your current data systems, architecture, and capabilities. 

Benefits

Data warehouses and data lakes each have their own benefits for companies that value their data. Review the benefits to help you further understand which is the best choice for your business needs. 

Data Warehouse Benefits

  • Because data warehouses are designed for data analysis, they provide more effective business insights that can improve decision-making at management levels. 
  • Speed is another significant benefit of using data warehouses. Workers spend less time combing through organized, structured data than unstructured data, allowing them to gain insights faster. 
  • When you democratize data, you also gain faster data distribution to data stewards across your organization. Stakeholders who rely on your data don’t have to wait for someone to compile it into a polished report before they can take action.
  • Another benefit of storing all your data in a single location is that it allows IT and cybersecurity professionals to shore up the defenses of a single repository instead of needing to protect multiple storage locations. 
  • While there are benefits to an on-site data warehouse, a cloud-based data warehouse allows you to scale it up or down based on your business needs instead of switching servers when your regular data volumes change. 
  • One of the main functions of a data warehouse is to automatically standardize data, which improves data quality. 

Data Lake Benefits

  • Data lakes allow data collection from multiple sources in their native formats instead of creating architectures for different formats.  
  • You can store petabytes of data in data lakes. 
  • Like data warehouses, IT and cybersecurity specialists only need to defend one data repository against threats and attacks. 
  • Also like data warehouses, data lakes allow for better data democratization.  
  • Data lakes like the Hadoop data lake allow you to use multiple schemas for the same data or use it with no schema. 
  • Unlike traditional data storage, data lakes are compatible with AI and machine learning analysis tools. 

How Data Lakes & Data Warehouses Work Together 

In short, data warehouses are designed more for the analysis of structured data and data lakes are designed more for the storage of unstructured data. This means that data warehouses are compatible with data lake integration, so businesses can store and analyze data seamlessly. Using products like the Actian Data Platform, you can query and write Hadoop data using the external tables feature. The data source must be read- and write-compatible with Apache Spark, like Parquet, ORC, or JSON file formats or tables in external database systems.  

The syntax CREATE EXTERNAL TABLE creates a table in the Actian Data Platform that points to existing data files in locations outside of the platform. That way, you don’t need to import the data into a new table when the data files are already in a known location in the desired file format. 

After the data file structure is mapped to the Actian Data Platform format using the CREATE EXTERNAL TABLE statement, you can: 

  • Select, join, or sort external table data. 
  • Create views for external tables. 
  • Insert data into external tables. 
  • Import and store the data into an Actian Data Platform database. 

The data is queried from its original locations, and the Actian Data Platform leaves the data files in place when you drop the table. 

The Actian Data Platform receives queries operating on external tables from the user or client app and rewrites them into JSON requests for external data. These requests are then translated into Spark jobs, which are launched. These jobs typically issue queries to SparkSQL like “INSERT INTO avalanche_table SELECT * FROM external_resource” for reading external data or “INSERT INTO external_resource SELECT * FROM avalanche_table” for writing to external systems. Finally, these jobs push and pull data in and out of the Actian Data Platform. 

You can even insert new records into an external table. The syntax for defining an external table is: 

CREATE EXTERNAL TABLE table_name (column_name data_type {,column_name data_type}) USING SPARK WITH REFERENCE=’reference’ [,FORMAT=’format’] [,OPTIONS=(‘key’=value {,’key’=’value’})] 

For more information, see the CREATE EXTERNAL TABLE in the SQL Language Guide. 

After external tables are defined with the CREATE EXTERNAL TABLE syntax, they behave like regular tables. You can issue queries such as the following: 

SELECT * FROM test_table_csv  INSERT INTO my_table_orc SELECT some_column FROM other_table 

Predicate Pushdown From the Actian Data Platform to Spark 

If an SQL query is issued to an EXTERNAL TABLE using Spark, predicates from the WHERE clause may already be evaluated by Spark, reducing the number of tuples sent. Only predicates that can be translated into a column value range are supported. For logical connections of simple predicates, IN and AND are supported on a single column and across columns. OR is supported only on a single column. 

Where the whole complex predicate contains a single OR spanning different columns, nothing is pushed down to Spark. In this case, Spark transfers all tuples to the Actian Data Platform, and the filtering is done solely on the Actian side. 

Here’s an example of how to use CREATE EXTERNAL TABLE: 

  1. Define an external table for a CSV data source residing in Amazon S3: 

CREATE EXTERNAL TABLE ext_csv (col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL) USING SPARK WITH REFERENCE=’s3a://<bucket>/file.csv’ OPTIONS=(‘DELIMITER’ = ‘|’, ‘HEADER’ = ‘TRUE’,’SCHEMA’=’col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL’); 

      2. Define an external table for a CSV data source residing in Amazon S3. The CSV data does not have a header row 

CREATE EXTERNAL TABLE nation_s3 ( n_nationkey INTEGER NOT NULL, n_name CHAR(25) NOT NULL, n_regionkey INTEGER NOT NULL ) USING SPARK WITH REFERENCE=’s3a://<bucket>/nation.csv’,FORMAT=’csv’, OPTIONS=(‘HEADER’=’FALSE’,’DELIMITER’=’|’,’SCHEMA’=’n_nationkey INT NOT NULL, n_name STRING NOT NULL, n_regionkey INT NOT NULL); 

     3. Define an external table for an ORC data source from Azure: 

CREATE EXTERNAL TABLE my_table_orc(a INT8 NOT NULL) USING SPARK WITH REFERENCE=’abfs://loadtest@avalanchetest.dfs.core.windows.net/my_table.orc‘; 

     4. Define an external table for an AVRO data source from Azure: 

CREATE EXTERNAL TABLE tweets (username VARCHAR(20), tweet VARCHAR(100), timestamp VARCHAR(50)) USING SPARK WITH REFERENCE=’abfs://loadtest@avalanchetest.dfs.core.windows.net/twitter.avro‘, FORMAT=’com.databricks.spark.avro’   

     5. Define an external table for a JSON data source using the following JSON file: 

[{  “symbol”: “MSFT”, “company”: “Microsoft Corporation”, “stock_tstamp”: “2020-01-31T21:00:00+00:00”, “price”: 170.777, “volume”: 36142690, “sector”: “TECH” }, {     “symbol”: “AAPL”, “company”: “Apple Inc.”, “stock_tstamp”: “2020-01-31T21:00:00+00:00”, “price”: 309.51, “volume”: 49897096, “sector”: “TECH” },  {     “symbol”: “GOOG”, “company”: “Alphabet Inc.”, “stock_tstamp”: “2020-01-31T21:00:00+00:00”, “price”: 1434.23, “volume”: 2417214, “sector”: “TECH”  }, { “symbol”: “AMZN”, “company”: “Amazon.com, Inc.”, “stock_tstamp”: “2020-01-31T21:00:00+00:00”, “price”: 2008.72, “volume”: 15567283, “sector”: “TECH”}] 

Example SQL to create an external table to reference that JSON file stored on an Azure ADL: 

CREATE EXTERNAL TABLE techstocks ( symbol VARCHAR(4), company VARCHAR(20), stock_tstamp TIMESTAMP, price FLOAT, volume INTEGER, sector CHAR(5) ) USING SPARK WITH REFERENCE=’abfs://stockquotes@eastusstockdata.dfs.core.windows.net/tech1.json‘, FORMAT=’json’, OPTIONS=( ‘multiline’=’true’, ‘SCHEMA’= ‘symbol string, company string, stock_tstamp string, price double, volume integer, sector   

Define an external table for a CSV data source in Google Cloud Storage. The CSV data has no header row: 

CREATE EXTERNAL TABLE stations ( n_stationkey INTEGER NOT NULL, n_stationname CHAR(25) NOT NULL, n_locationkey INTEGER NOT NULL ) USING SPARK WITH REFERENCE=’gs://<path>/stations.csv’,FORMAT=’csv’, OPTIONS=(‘HEADER’=’FALSE’,’DELIMITER’=’|’,’SCHEMA’=’n_stationkey INT NOT NULL, n_stationname STRING NOT NULL 

Data Lake & Data Warehouse Use Cases 

Are you still unsure whether data lake or data warehouse software is more appropriate for your business? Are you unconvinced that you need either? Here are some industry-specific use cases to show you how organizations around the world employ these solutions to streamline their professional workflows.  

Data Lake Use Cases 

  1. Retail professionals can store purchase data, information on products added to carts and not purchased, and browsing patterns. They can also use natural language processing (NLP) to store company feedback without sending out surveys.
  2. Healthcare industry members, like your doctor, can collect several types of IoT data from multiple health monitoring devices and store them in their native formats for later analysis.
  3. If you work in the banking or financial industry, you can collect data like monetary records and transaction histories from customers worldwide in one centralized, high-capacity location.
  4. Optimizations in the transportation industry rely on many kinds of complex data — sometimes collected from across the country — and data warehouses can be the best solution for storing the sheer volume of data that needs to be analyzed.
  5. Many manufacturers must coordinate material sourcing, manage production lines, plan shipping logistics, and more. Data lakes can store data from all these parts of the business — no matter if they’re off-site — in one easy-to-access location.

Data Warehouse Use Cases 

  1. Retailers can analyze customer purchase data in data warehouses to gain insights into buying patterns and other customer preferences to improve their shopping experiences, inform remarketing efforts, and build customized campaigns. 
  2. After several data formats from IoT devices are stored in data lakes, they can be organized and streamlined in data warehouses. Then, they can be distributed to doctors to give them a fuller picture of ongoing conditions and more accurately recommend treatments. 
  3. Once financial data is gathered from individuals, banks can analyze it in data warehouses to assess a person’s loan worthiness, trigger overdraft fees, or recommend credit card limits — all based on personal buying habits backed by hard data. 
  4. Running traffic or flight data through a data warehouse can streamline it for predictive analysis, which can be crucial for optimizing routes and schedules in the transportation industry. 
  5. With so many moving parts involved in running a manufacturing business, it can be hard to see the big picture in order to make optimizations. Running data through a data warehouse can help managers see the bigger picture of how effectively each part of the manufacturing and shipping process operates and give insights into areas that need improvement.