Modern companies run on data, and much of it. A data warehouse provides a single place to aggregate data from all your IT systems where you can analyze it and develop the insights you need to be competitive. This guide to data warehouses will explain what a data warehouse is, why you need it, how it’s used and the benefits you can achieve.
Data Warehouse Definition
A “data warehouse” is a repository of historical data that is organized by subject to support decision makers in an organization. Data warehouses are systems used to store data from one or more disparate sources in a centralized place where it can be accessed for reporting and data analytics. The data in the data warehouse may be current or historical, and may be in its original raw data form or processed/summarized.
The data in a data warehouse is imported from source systems (such as ERP, CRM or Finance platforms) and gathered in the warehouse where it can be used across the enterprise for creating analytical reports and to support business decision-making. The general process used to aggregate and transform data for warehousing is referred to as “extract, transform and load,” or ETL for short. What this means is a company takes a copy of data from source systems, leaving the original data intact and in place – avoiding disruption to transactional processes that may be occurring.
Once data is loaded into the data warehouse, it is further refined and processed to remove data quality issues, integrate interdependent data sources and organize it for ease of consumption. Data warehouses also often contain pre-processed summaries of data and snapshots of data from different points in time that are used to assist in analysis. Where transactional systems are most concerned with maintaining the current state of data (and do so by overwriting values when data is updated), warehouses maintain the history of how a company’s data changes and evolves. This is particularly important when conducting trend analysis and other business analytics intended to answer questions about “why” something happened within a company.
Why Do Companies Need a Data Warehouse?
With companies embarking on digital transformation of their business operations, almost every business process is becoming dependent on a multitude of IT systems and the data those systems record and maintain. For companies to operate efficiently and achieve their goals, employees from top-level executives to individual contributors need efficient access to data and analytics that provide actionable insights into how a company is operating, areas of risk or concern and opportunities for competitive advantage. The information these people need comes from many different source systems, but for it to be easily accessible for analytics, companies need the data aggregated in a common place. That is a data warehouse.
- The need for information insights that span multiple source systems.
- Provide a long-term archive for transactional data, so source systems can be purged to maintain high performance.
- To provide a place where reporting and analytics can occur without creating an additional load on operational systems.
The need for integrated information insights is by far the biggest driver for investing in data warehouse systems, because it represents both strategic business needs (leading to competitive advantage) and tactical/operational needs for supporting day-to-day business operations. Data warehouses are costly IT investments, both to install initially and to operate. As such, they are viewed as a long-term investment and during time become part of the underlying fabric of a company’s IT ecosystem. The data stored in a data warehouse is often sourced from across the enterprise and enables users from disparate business functions to leverage data resources that extend far beyond their direct area of control or influence.
Data Warehouse Concepts
There are many different variations on data warehouses and related concepts that can be confusing for someone new to the subject. Here are some of the most commonly related data warehouse concepts to help demystify the data warehouse topic:
- Enterprise Data Warehouse (EDW) – EDW is a data warehouse designed to support an entire company, not just a single function. Enterprise data warehouses are common in large companies and serve the important role of bridging across business units, locations and fragmented IT systems. Even if your company has multiple transactional systems, an EDW can be used to gather all your data in one place for centralized analytics and reporting.
- Operational Data Store (ODS) – An operational data store is the part of your overall warehouse that contains the aggregated raw data from your transactional and operational systems before it is translated and summarized. Companies often maintain an ODS separate from their data marts as a means of providing analysts access to the underlying data used to generate reports.
- Data Mart – Data marts are a simplified view of data in a warehouse that is focused on a single subject or functional area. A single department within an organization often builds and controls data marts and may (or may not) be integrated with the enterprise data warehouse. Companies that lack a full-feature data warehouse may have some data marts instead. Data stored in data marts is typically organized and filtered to support the specific needs of the sponsoring department. Most data marts are refreshed nightly from source systems, so the data they contain may be as much as 24 hours old.
- OLAP – Online analytical processing are specialty systems designed to support data-mining activities. They apply multiple layers of complex algorithms to roll-up, drill-down and slice and dice raw data into business insights. OLAP systems typically execute multiple times during a day, generating analytics that is a few hours old (compared to data marts which often have latency of a full day).
- Business Intelligence (BI) – Business Intelligence is a broad term used to describe a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes. Data warehouses often form the core of a company’s business intelligence capabilities in addition to analytics and reporting tools.
- Data Mining – Data mining is the process of discovering patterns in large data sets. Modern data mining often involves a combination of machine learning, artificial intelligence, statistics and data warehousing. Companies mine data to harvest actionable business insights that lead to competitive advantage.
- ETL – Export, Transform, Load, or ETL, for short is the process used to move data from transactional source systems into the data warehouse where it can be further refined and consumed. The types and extent of data transformation often determines the level of data quality in the data warehouse.
- Data Cleansing – Data cleansing is the set of activities that are undertaken to address quality issues in raw source data. Data combined from different sources not only inherits the quality issues from the source data (such as inaccuracies, missing data, incomplete records and broken relationships), but also is likely to include gaps, redundancies and conflicts between data sources. Data cleansing addresses these issues before the data goes into the data warehouse.
- Metadata – Metadata is data about your data, such as the size, format, source, descriptions, relationships and data classification. Metadata is important in a data warehouse, because it helps users easily find and understand data that has been moved from its original context.
- Data Dictionary – The data dictionary is a set of reference data about the data objects, elements and attributes stored in your data warehouse. It provides users a means of understanding the content and context of data beyond simple labels and field descriptions.
- Data Governance – Data governance is the set of processes and controls that ensures data is created and maintained in accordance with company standards, policies and business rules and, as data is transformed, it maintains adherence to data definitions and integrity constraints defined in the data model.
Data Warehouse Architecture
There are two types of architectures that are important to understand in a data warehouse. The system architecture of the various technical components that are collectively the data warehouse solution and the data architecture of the information stored in the data warehouse.
The system architecture of a data warehouse is generally aligned to stages of transformation that take place as raw data is refined into actionable and consumable information insights for users. It can be thought of as similar to a manufacturing workflow, transforming raw materials into consumable, finished goods with multiple stages of refinement throughout the process.
- Data Sources – These are the various source systems that provide data into the data warehouse.
- Staging Area or Data Lake – The staging area is used to store raw data imported from each of the different data sources, so it can be processed without causing performance impacts on transactional systems.
- Integration – The integration layer used to connect the disparate data from various sources together, establishing key relationships, resolving duplicates and storing the data in an operational data store (ODS).
- Warehouse Database – Data is moved from the operational data store into the core warehouse database where it is catalogued and arranged in hierarchical groups called dimensions and into facts and aggregate facts. This hierarchical organization makes it easier to find data in the warehouse and improves processing performance.
- Operational Data Warehouse – An operational data warehouse (ODW) addresses the need for operational analytics that provide the foundation for real-time analytics. Characteristics of an ODW include currency, high-performance, the need to handle updates efficiently and the flexibility to be deployed on-premises and in the cloud.
- Data Marts – Data marts are views, or subsets, of the data that are summarized and curated for a specific audience. The goals of data marts are to make accessing data in the data warehouse easier and to apply a level of access control to the data – ensuring only those people authorized to use it can.
- Users – These are the many individuals, business processes and systems that access data from the data warehouse. This includes both human users and other systems, such as Artificial Intelligence (AI) and decision support systems.
The data architecture of a data warehouse refers to how data is organized within the warehouse. Data architecture is a more complex topic, because the transformations that occur within the data warehouse system often involve changes from one data architecture to another. The data in most source systems is organized as relational database schemas following a set of principles called data normalization introduced by Edgar Codd in 1970. While normalized relational data architectures work well for transactional systems that are only concerned with current data, data warehouses transform the relational data into what is called a star schema, which aggregates data based on content and how it is consumed. Data marts (the views within a data warehouse consumed by users) may transform the data architecture again, into either business process or simplified subject area views that align to how the data is used for decision making.
What is a Data Warehouse System?
The data warehouse is a stable, read-only database that combines information from separate systems into one easy-to-access location. It is a layer on top of other databases that is specifically designed for supporting analytics. The term “data warehouse system” is used to refer to the set of components that work together to provide the overall data-warehousing capability to an organization. Many commercial software providers sell data warehouse systems, and are available in both on-premises and cloud-hosted options. A data warehouse system is typically comprised of one or more databases, tools for performing extract, transform and load (ETL) from source systems, capabilities for managing data schemas and the data dictionary, with tools for publishing data to data marts and consuming systems.
Data Warehouse Examples
There are many examples of companies using data warehouses today. Some common examples include:
Sales Data Warehouse
A specialized data warehouse containing data about a company’s sales and marketing activities. It would likely include content, such as customer data, sales transactions, marketing campaigns, customer sentiment/feedback and data about the competition. Sales and marketing teams use a sales data warehouse directly for lead generation and customer targeting as well as other business functions, such as product management developing new offerings and finance teams performing revenue-growth projections.
Manufacturing Data Warehouse
Companies that manage complex manufacturing processes, logistics operations and outsourced supply chains often have a dedicated data warehouse to aggregate all their manufacturing operations data in a common place for reporting. This enables them to perform robust analytics, looking, for example, for quality issues and performance improvement opportunities without disrupting manufacturing processes that must use transactional systems. Companies with outsourced supply chains also often use a manufacturing data warehouses as a place to aggregate data from multiple suppliers without developing numerous point-to-point integrations between ERP systems.
Enterprise Data Warehouse
Large organizations often choose to centralize their data archives into a single enterprise data warehouse that contains data from IT systems and processes across the organization. Although enterprise data warehouses can be expensive to build and operate, they provide the greatest opportunity for identifying actionable business insights that span business functions and organizational boundaries. An enterprise data warehouse can also provide a company with the capability for long-term data retention that may be necessary for regulatory compliance. Data can be stored in the EDW even after source systems have been retired and decommissioned.
Data Warehouse vs. Database
Databases are an important component of your data warehouse, but the two terms are not interchangeable. A database is the generic term for a storage system where you record data, which is used for many purposes, including, for example, transaction processing, supporting application functionality and enabling reporting. Databases include, for example, OLTP (application databases), OLAP (used in data warehouses), XML, CSV files, text files and spreadsheets. Most databases are constrained in use to a specific application, business process or purpose. A database designed to handle transactions isn’t structured to do analytics well.
In comparison, a data warehouse is a specialized set of capabilities for extracting data from transactional systems and storing them in a specific type of database that it organized and optimized to support data analysis and reporting. A typical data warehouse includes multiple databases that store data at different levels of transformation, including source databases, operational data stores, the core data warehouse database and specialized data marts that present filtered views of the data to users.
Benefits of a Data Warehouse
Companies implement data warehouses for many reasons, ranging from the need for strategic business insights to addressing tactical challenges of data aggregation and retention. In general, data warehouses provide the following benefits:
- Integrate Data from Multiple Sources into a Single Data Model – This benefit is of importance for companies that leverage pre-packaged and SaaS software offerings that have their own unique data model that cannot be customized to align to a company standard. The data warehouse provides a place where data-model incongruencies across source systems can be reconciled to assemble the big picture of a company’s data assets.
- Provide a Unified Place for Accessing Data – It can be both expensive and time-consuming for users to access data from the wide variety of source systems in use across a company. A data warehouse provides the opportunity to aggregate data in a common place where it can be organized and presented to users for easy use.
- Maintain Data History Without Slowing Source Systems – As business workflows operate, they are continuously producing new data. During time, the volume of new data created begins to slow transactional systems. By moving data to a data warehouse, the source systems can be purged of old data to maintain transactional processing efficiency.
- Enable a Centralized View Across the Enterprise – Most departments within an organization are territorial about the data they create, wanting to control and restrict who can access it and how it is used. While there are cases where this is a good practice (such as proprietary trade secrets), much of the data produced across a company has the potential for creating value in other parts of the enterprise. The data warehouse provides a common place where functionally siloed data can be gathered to create the big picture of enterprise data, which often yields insights that help business leaders with investment decisions and strategic planning.
- Cleanse and Reconcile Ambiguous and Duplicate Data – It is very rare for data from different source systems to fit together cleanly and seamlessly. More often, when you gather data, there are data conflicts, gaps, redundancies and missing pieces of information that must be fixed for the combined data set to be used effectively for analysis. The differences may be intentionally based on business need, but when used in the context of reporting can cause confusion. The data warehouse provides an opportunity for companies to apply business rules to resolve data-quality issues without having to change the source systems.
- Centralized Monitoring and Control to Enforce Data Access and Use Policies – Data governance and controlling who in the organization is authorized to access and use data is an important part of maximizing the value of a company’s data assets. Because a data warehouse provides a centralized repository of data from across the company, it also provides the ideal place to implement data-access policies.
- Avoid the Performance Impacts of Querying Transactional Systems for Reporting – Analytics queries tend to be very complex and consume considerable system-processing resources. Running analytics and reports against transactional systems can cause user interfaces to experience latency issues and business workflows to become slow. Data warehouses provide a separate environment where analytics queries can safely run without impacting the performance of source databases or the applications that rely on them.
- Organize Data so it Makes Sense to Business Users – Transactional systems have data structures optimized for the performance of the applications and business workflows that they support – not to be easily understood by human users. Data warehouses and data marts enable a company’s data to be re-organized, cataloged and described in ways that business users can understand and makes the data they need to make decisions easier to find.