ETL is an acronym for extract, transform and load. The ETL process aims to get raw data out of source systems, refine it and load it to a target data warehouse where it can be used for business decision-making.
What are ETL Tools?
Data engineers and data professionals use ETL tools to populate a data warehouse with data of sufficient quality that it can be depended upon for decision-making. ETL tools provide the means to simplify and manage the ETL process, making it possible to scale data movement to data warehouses through automation.
ETL tools ease connectivity to data sources and provide functions to filter, merge and fill data gaps using a staging database. The output data from the staging and data cleansing process is loaded into the target data warehouse. ETL tools provide end-to-end monitoring of the data transfer and transformation process and provide scheduling capabilities to provide ongoing operational management. Most tools focus on the content and format of the data using third-party file transfer technology to move data in streams or batches.
Types of ETL Tools
ETL tools can be segmented into the following categories, although many tools cover multiple segments:
- Batch ETL tools – that schedule data transforms and transfers overnight or in micro-batches.
- Real-time ETL tools – that support data streaming or data replication through CDC (Changed Data Capture).
- On-premise ETL tools – that provide downloadable design studio tools to ease development.
- Cloud-based ETL tools – that offer deployment across multiple cloud platforms.
Is SQL an ETL Tool?
There is a class of ETL known as ELT, which loads raw data into the target database, where it is transformed within the same database. Structured Query Language (SQL) can be used for some ETL functions but lacks the monitoring and management aspects of major ETL tools. SQL can be used to perform data transformation using built-in functions. SQL itself can filter, merge and sort data. If the data warehouse technology supports external data, the data loading step can be bypassed in some cases. However, there is a serious performance penalty for using external data.
Many commercial databases offer distributed SQL capabilities so that you can create remote tales using the CREATE REMOTE-TABLE-NAME AS SELECT * FROM LOCAL-TABLE-NAME, for example. Data can be moved across remote nodes using INSERT INTO LOCAL-TABLE-NAME as SELECT * FROM REMOTE-TABLE.
Is SSIS an ETL Tool?
Microsoft SSIS (SQL Server Integration Services) is an ETL tool with the ability to build workflows to support data pipelines for SQL Server data warehouses. SSIS includes a graphical designer user interface used to develop an ETL package that includes procedural logic and error handling. SSIS is very much geared towards the SQL Server platform, so it should not be considered as a general ETL tool that spans platforms beyond windows and that can be used with non-Microsoft databases.
Data Management in Cloud Computing
The main difference between on-premise systems and cloud computing is that you are using someone else’s server and storage resources over a secure internet connection. You can build an application platform from raw iron with a local setup in your own data center. You can choose what operating system (OS) to use, decide if you want to use virtualization software and select from directly attached or network attached storage. Everything is connected using high-speed gigabit network connections.
Data management on-premise is easy because your data and servers are in a single location with low latency connections. You have purchased the hardware, so you don’t need to pay for use with a metered subscription for CPU and storage. The downside of this approach is that you need to buy more hardware when you have used the available capacity, and you usually buy bigger systems than you need, as you must cater for usage peaks.
Data management in cloud computing environments has some distinct advantages. Because you are using a subscription pay-as-you-go pricing model, so don’t have to lay out capital budgets for expansion; you can buy additional storage as needed. Another major advantage is cloud systems are increasingly software-defined, so you are not constrained from having to size for peak storage as you can expand and shrink your storage footprint as needed. If you are a retailer and your business is seasonal, you can size your storage and compute to match seasonal processing cycles.
Storage properties can be very different in the cloud. Every provider offers tiered storage, so you can choose to pay for expensive, high-speed SSD-based storage, or if performance is not as critical, you can use traditional hard disks to save money. Database technology in the cloud is becoming increasingly serverless, so you get to enjoy elastic compute and storage that is abstracted from physical server and storage device constraints. You simply choose the different storage and compute classes that meet your application’s needs.
High availability is also different in the cloud because you pick a cloud data center near where you generate and process your data. For high availability, you can split your storage across multiple storage devices to protect from device failure. To protect against data center failure due to disasters such as fire, flood, or earthquakes, you can designate a standby data center in a different geography.
Network latency is a significant consideration when you are operating in the cloud. The network connections between cloud data centers will not be as fast as within a given data center. It is advisable to perform data analysis in the same cloud region that hosts your data lake. Public cloud providers usually charge egress charges based on the volume of data you are moving, which is an additional reason to process data where it is created.
Cloud providers have their own data management ecosystems, such as Google Big Query, Azure Synapse and Amazon RedShift, to provide compelling reasons to lock yourself into their platform. However, most businesses don’t want to single-source critical technology so they can always procure the best value when needed. For this reason, data management solutions that span multiple cloud platforms and can run on-premise offer maximum flexibility. The Actian Data Platform offers this flexibility. The Actian Data Platform is designed to deliver high performance and scale across data volumes, concurrent users, and query complexity.
Cloud Data Management Mistakes to Avoid
Cloud data management mistakes to avoid include:
- Avoid vendor lock-in by selecting a data management solution that spans clouds and offers on-prem options. Choosing RedShift, for example, makes it hard to migrate to cloud platforms beyond AWS.
- Don’t put your data in a different cloud region from where you process that data, risking high egress charges. Sometimes it is more cost-effective to ship bulk data by truck than over internet connections.
- Don’t use a cloud data warehouse that is incompatible with your on-premise technology to keep training costs down and migration options open. Actian offers the same database engines in the cloud as on-premise.
- Don’t fragment your data. Try and consolidate data to as few platforms as possible. If you are collecting data at the network edge for an IoT app, try to consolidate to 3 or 4 data centers to control fragmentation.
- Systems that couple storage to compute can be wasteful, so look for data management solutions that let you scale compute and storage independently. Actian and Snowflake exploit decoupled compute and storage capabilities in cloud platforms.
Look for best-in-class infrastructure, including the latest hardware and GPUs, broad application support, solid security, expert support, and a reasonable easy-to-understand cost structure.
The Evolution of ETL
- In the 1970s, databases were loaded using custom code or entered by humans doing manual data entry.
- In the 1980s, batch loaders imported flat files to databases such as DB2, Ingres and Oracle.
- In the 1990s, data warehouses began to use a formal ETL process.
- In the 2000s, ETL became more formalized, and newer dedicated ETL tools emerged.
- The 2010s, saw the rise of cloud computing with SaaS data warehouses.