ETL

What is ETL?

ETL pipeline streaming

ETL is short for Extract, Transform, and Load. Data engineers and data professionals commonly use the term to describe the multi-step process of getting raw data into a form that can be used to perform valuable data analytics that supports improved business decision-making.

ETL Meaning

The best way to describe the ETL meaning is by taking each letter and discussing its role in the ETL process.

The Extraction Step

The first phase of the ETL process involves extracting raw operational data from many source systems. If the dataset you are loading does not need to be refreshed more than once per day and is not too big, you can extract a daily copy and overwrite the table in your target database. Relational databases provide an SQL feature called TRUNCATE TABLE, which will set the high-water mark to zero, which has the effect of instantly emptying the table. This is more efficient than running a SQL DELETE operation which must process each row and consume a ton of CPU cycles. This approach is known as a full extraction.

If the source data needs to be cleaned up or transformed or it is very large, then it is better to extract only the most recent data, clean it and load it. This approach is known as a partial extraction.

The most sophisticated extraction is known as a partial extraction with update notification. This approach is used when your target database needs near real-time updates of source data changes. For example, if it is a stock trading system, the source system can push changes to several subscribing target systems. Suppose the source application is a mobile phone network tracking thousands of call transactions per second. In that case, the extraction can be performed in micro-batches that poll for changes to the source system every 5 seconds or so and applies those changes to the target system in near-real-time.

Many database systems support triggers that fire when an event such as an insert, delete or update operation has occurred. These triggers can be programmed to update a log of changes for downstream systems to extract data from. Using a messaging system such as Amazon SNS enables you to send notifications directly to your target systems to inform them of updates to the source system. This kind of asynchronous relationship between source and target data systems ensures the operational source systems never get held up because a target system cannot accommodate the update for any reason.

The relations between source and target systems can be a simple point-to-point connection which is simple to set up and maintain. Over time, businesses create hundreds of such one-to-one connections, which become too burdensome to maintain and administer. At this point, a vended ETL tool can consolidate the management of all the connections, making them a great investment. A very important aspect of extraction is remembering the data source, so systems can catalog it to maintain and audit the trail of where data came from. Data provenance (also referred to as “data lineage”) is metadata that is paired with records that detail the origin, changes to, and details supporting data source origins. Tracking sources in this way helps with controlling duplication and informing whether data can be trustworthy or reliable.

Extracting data can be as simple as importing a CSV version of a spreadsheet or as complex as using a web service API. Most business applications provide APIs to make extraction possible using an application or ETL tool. The format of the source data often dictates how it can be extracted. A JSON stream, for example, requires the use of a web services API. A flat file can be read sequentially. A database source usually has an export utility to unload data into a flat file. For example, in a mainframe system, a VSAM dataset that is usually accessed by key

The Transform Step

Raw data from an operational source usually needs some manipulation and cleansing to make it suitable for general consumption. Data transformation often takes place in an intermediate or staging database. Raw source data can have excess data that needs to be filtered out. It can have gaps that can be filled with default values or interpolated data. Duplicate data can be aggregated or removed depending on the intended use in the target database. If related data is being extracted from multiple sources, it can be merged to create a single record. Data values can be normalized. For example, different sources of customer data may record them differently. One system might spell out the state as in “New York”, and others use an abbreviation such as “NY”. Standardizing on the abbreviation can be done during data transformation. Transformation can include data quality operations such as verifying values across multiple sources before passing them downstream as a trusted value.

Conceptual illustration of a data cloud representing the process of ETL

The Load Step

Data loading into the target system is the final step in the ETL process. The data load can be done using a bulk loading utility, data streaming, asynchronous messaging, or a custom program. Existing data can be overwritten entirely, augmented, or partially updated with fresh values. The update frequency can be performed in batches such as overnight, periodically in micro-batches or continuously in a data stream.

ETL History

In the 1970s, databases were loaded using custom code or entered by data entry clerks. Volumes were low by today’s standards, and error rates could be high due to human error. Databases were primarily used for running operational systems; any analysis was done by people looking at printouts.

In the 1980s, minicomputers and personal computers could be used to create flat files for loading into analytic systems using batch loading programs. The emergence of spreadsheets for data analytics resulted in the standardization of data in delimited formats such as CSV file formats for sharing data across computers. Networks were slow, so data was often transported on magnetic tape and floppy disks. Data analysis was performed using statistical packages such as X11 and Box Jenkins for multivariate regression analysis, and data was reported using 3GL code written languages such as FORTRAN.

The 1990s saw the rise of the data warehouse, which brought with it the ETL process. Decision support systems were built on databases running on various versions of Unix and Linux. Enterprise or centralized data warehousing functions could be distributed to more localized systems with the business unit or region-specific data marts. Data integration was typically point-to-point.

The 2000s saw ETL being standardized by vendors offering data integration tools that helped organizations scale the movement and transportation of data between systems and offered desktop graphical user interfaces (GUI) to create and manage ETL processes.

In the 2010s, internet-based data transfer and SaaS applications became very popular, as did the internet-scale publish-subscribe offerings on public cloud platforms. The level of sophistication of ETL and data integration models grew with the advent of hub-and-spoke-based data models.

Today, ETL is ubiquitous, and the market is consolidating ETL and database management functions into integrated cloud-based services.

How Does ETL Work?

We can use a Customer360 application as an example to illustrate how ETL works. Account managers and salespeople use the target database to sell services to new customers, cross-sell new services to existing customers and renew existing subscriptions.

Before calling the customer or prospective customer, the team needs to know as much about them as possible to show they care enough by doing advanced research. The target system needs to show how long they have been a customer, how much they have spent, and what services they have today and in the past. Satisfaction measures such as survey ratings and open support cases are critical to know about as it’s hard to sell to unhappy customers.

The source data needs to come from systems such as Salesforce, ServiceNow and SAP. Salesforce data needs to be extracted by exporting a CSV file based on a customer-specific report. ServiceNow provides a REST API that an ETL tool can interrogate from a command script. SAP can create a CSV file that can be imported into a BI tool’s repository, which is often an embedded edition of SQL Server. All the data sources extracted by the ETL process can be used to create a customer 360 dashboard for the sales or account management teams.

Why is ETL Important?

The number of business data sources and data types increases yearly—the number of connections needed to these systems also increases. The cost of administrating and maintaining one-off point-to-point connections quickly becomes prohibitive. Having a tool that lets you make plug-and-play connections to any data source and manages data transformations, and load operations is essential. Once you have an integration running, it is vital that it can schedule data flows with good visibility. Without a formal ETL process, an organization would spend all their available time and budget keeping connections running rather than new projects that increase revenues for the business.

Man in suit drawing a bar graph to illustrate what ETL is and why is it important

What is the Difference Between ETL and ELT?

The notion of Extract, Load, Transform, or ELT arose more recently than the traditional ETL. The two main ways they differ are:

  • ETL loads transformed or clean data into the target data warehouse. ELT, on the other hand, transforms raw data in the data warehouse.
  • ETL uses an intermediate data store and server to transform data. ELT performs data transforms within the target data warehouse.

ETL can be seen as being more cumbersome as it uses more servers and data stores. This makes ELT appear to be a more streamlined data pipeline. Data loading and unloading can be slow, so performing the transformation in the same server can be faster.

If the target data warehouse needs key personal data scrubbed, then doing the filtering and in a staging system makes ETL attractive.

How ETL is Being Used

The ETL process can be managed at different levels of sophistication:

  • The base level is simply building a point-to-point connection between each source and target system. This approach does not scale well, is cumbersome to maintain and is inflexible.
  • Centralizing the management of ETL is more scalable as it provides one place to look and uses best practices to build new integrations.
  • The advanced approach is to use a vended data integration solution that has the notion of an integration bus or hub that maintains metadata about data sources to comply and scale with the organization’s changing needs.

What is an ETL Pipeline?

An ETL pipeline comprises the chain of processes used to get data from raw sources to the target data warehouse. A pipeline can execute as a batch, micro-batch or real-time stream; it changes as they occur in the source system.

ETL Challenges

ETL faces many challenges, with complexity being the primary one. Many ETL flows contain multiple steps that are hard to test and maintain because they support even more complex systems. A business can have hundreds of ETL flows that need to be monitored, administered, and maintained. IT systems are not static. They need ongoing software patches to the technology stack supporting applications that quickly become obsolete and need to be updated or migrated to maintain stability. Because ETL spans multiple systems, its complexity is compounded.

The volume and diversity of data sources a business connects are growing. Sources can contain structured, semi-structured data, real-time sources, flat files, CSVs, S3 buckets and streaming sources. Internet of Things (IoT), website visit logs, social media sentiment, and video are creating new classes of content to mine for insights.

Data warehouses collect source data from production systems that need to be highly available, so they offer a few inclusive maintenance windows for system testing.

ETL flows can cross multiple business siloes, making it hard to understand the proper flow of business information across business units. Complex ETL prosses can become unreliable or even lose data. Data can become stuck because a data format or value causes exception records to flow into error logs. All such logs need to be monitored as they can lead to unexpected results at the target data warehouse.

Connecting disparate systems that were designed as monoliths can lead to side effects in downstream data as changes are made to individual systems.

Actian ETL Tools

There is a wide variety of ETL tools. Some were conceived to focus solely on solving ETL challenges—many tools such as Actian DataConnect address a broader data integration market.

Actian’s DataFlow is designed to operate as parallel streams, greatly improving the throughput of ETL operations such as data filters and transforms. The DataFlow expression builder helps you construct expressions for Derive Fields, Filter Rows, Data Quality Analyzer, Filter Existing Rows, and Join nodes. This dialog provides a list of available fields, operators, and functions you can use to build expressions.

In rare cases, databases are architected to support data portability. For example, the Actian Zen database file format is the same on Unix and Windows, making the extraction step a little easier, especially when dealing with hundreds of IoT log data sets.

Actian DataConnect is an example of a comprehensive ETL solution. Actian DataConnect can be used to design and manage integrations deployed in the cloud, on-prem, or hybrid environments. DataConnect supports a broad set of data types and scales to large data volumes. A desktop user interface lets users visually create or modify integration maps, artifacts, rules, and job schedules, without coding or scripting. The emphasis is on reuse and adaptability so businesses can cost-effectively integrate diverse data and applications.

Actian X helps businesses with ETL by avoiding custom-coded development efforts to meet a single business requirement. As integration requirements grow, problems with custom code begin to mount – inability to reuse, difficulty to maintain due to poor/ or no documentation and the original developer no longer being available. DataConnect for Actian X enables an organized approach to add data integration features to existing applications incrementally.

The Actian Data Platform is a data warehouse with inbuilt native integration capabilities built. Avalanche has direct loaders to pull in bulk data from popular data sources such as S3, ADLS and Hadoop. The Actian Data Platform also includes over 200 connectors and templates for easily sourcing and moving data from SaaS applications through ODBC, JDBC, .Net and Python.

Best Practices

The following advice comes from the Actian DataConnect documentation on best practices for data integration which apply equally to the ETL subset:

Architect the Solution – When starting a new integration project, first identify the business problems for which you need to provide solutions. In conforming to the Agile Methodology, create the requirements document but only invest as much time for the team to understand the objectives and the required outcomes. Make sure to update this requirements document as you progress

Design for Reuse – There are several ways to create organized and easy-to-understand integrations. It gives more clarity among team members by providing familiar and recognizable patterns as the project size or the number of projects increases. During the design phase, set up collaborative workspaces, use naming conventions, map business processes to the data flows, create templates, and reuse connections across ETL data pipelines.

Creating templates is also beneficial as it provides a place to document ETL steps visually.

Common techniques to be considered include making scripts reusable through the use of parameter passing, using constants, using parallelism where possible and validating data. In addition, consider time zones and localization.

It is important to test ETL processes using sample data during development and, even more important, to do a system test in parallel to the existing system until you are confident that you can switch to the new process.

During production, use a log manager to monitor for exceptions so that you can catch any unintended side effects of system changes to your ETL pipeline.

Visit the Actian website to learn more about our complete data products and solutions suite.