Managing Data Across Distributed Data Warehouses
Actian Corporation
July 17, 2019
Data integration, like the digital transformation initiatives it supports, is a journey and not a destination. Every company is somewhere on a path from the past to a desired state of future integration they’d like to achieve. If your company has existed for a number of years, then you likely have multiple databases, data marts and data warehouses, developed for independent business functions, that now must be integrated to provide the holistic perspective that digitally transformed business processes require.
You may have the ambition to merge all of your data into a single data warehouse (a potentially multi-year effort); you might also decide to keep your legacy systems; or you could choose to restructure your data in a new way and distribute it across regional data warehouses. There is a commonality to any of these scenarios – you will be managing distributed queries for quite a while (if not indefinitely).
IT leaders are increasingly accepting the necessity of distributed queries, but, simultaneously, are becoming more concerned about performance implications for both operational systems and analytics that must leverage data from distributed warehouses. Cloud-based integration platforms and hybrid data warehouses are providing an answer to some of these challenges.
Why Are Distributed Queries Problematic?
Distributed queries that span multiple data warehouses are a performance challenge because of the latency caused by remote joins, iterative operations and large data sets that (in addition to being processed within a database) must also traverse network infrastructure. As analytics become more complex and the underlying data sets increase (both common situations), the query requires more time to process.
This isn’t a big deal for batch-processing operations, but the use cases for integrated data are increasingly focused on providing real-time insights. This may be part of a transactional process an agent is performing (such as checking order history, stocking or compiling a 360-degree view of the customer, etc.) or it may be real-time monitoring and analytics to drive operational decision-making/process-tuning. Either way, end users have a low tolerance for data-processing delays and expect the data to be correct, robust and current (real-time).
How to Address the Distributed Queries Challenge?
IT teams have a few available options to address the performance issues of distributed queries.
- Merge distributed data warehouses into a single instance – While this seems like the most efficient solution, it often is not a cost-effective alternative due to legacy infrastructure investments, migration costs and business disruption.
- Separate the queries and perform aggregation processing in the application layer – Many small companies have used this approach, relying on either Web services or client applications to combine data from distributed sources. While it avoids the distributed query problems, application infrastructure typically has less processing capacity and speed than database infrastructure and as a result, desired performance gains are rarely achieved.
- Add an operational data warehouse aggregation layer to the solution architecture – This approach is proving to be the most effective method to achieve high-performance data processing at enterprise scale. Leave the data in the source systems and replicate the info you need for queries into a specialized data warehouse layer that is purposely built for aggregation and servicing real-time query requests. These systems often can be implemented with minimal modification to application code and can be scaled using cloud services to support even large enterprise data sets.
Distributed data warehouses are likely to be a part of the IT ecosystem of companies for many years. Making informed decisions about how to manage data across these warehouses and support real-time, distributed-query operations is essential to helping your company move from basic digital transformation towards real-time, data-driven decision-making and enterprise business agility.
Actian cloud-based data management platforms can help. In addition to providing a hybrid cloud-based integration platform based on Actian DataConnect, Actian also enables organizations to deploy data warehouses across cloud platforms and on-premises.
Subscribe to the Actian Blog
Subscribe to Actian’s blog to get data insights delivered right to you.
- Stay in the know – Get the latest in data analytics pushed directly to your inbox
- Never miss a post – You’ll receive automatic email updates to let you know when new posts are live
- It’s all up to you – Change your delivery preferences to suit your needs