Data pipelines form a multistep process that moves and refines data from source systems, including a data lake, to a target data platform. The steps are in a series where each can be parallelized to accelerate data movement through the data pipeline. Data pipelines automate the movement, transformation and cleansing of data from a data source, on its journey to the destination data repository.
Why Use Data Pipelines?
Data pipelines provide end-to-end visibility and control over the flow of data. This provides opportunities to create reusable automation elements and enhances data provenance. Using standardized processes and tools also aids data governance efforts.
Difference Between ETL Pipelines and Data Pipelines
A data pipeline with a broad scope can contain extract, transform, and load (ETL) steps. ETL processes invariably end in a database. It can end in an intermediate refinement stage, such as a data lake.
Benefits of Using Data Pipelines
Below are some of the benefits of using data pipelines:
- Supports a systematic approach that can be automated.
- Components of the data flow can be reused to lower ongoing development costs.
- Data sources can be traced to support data provenance.
- End-to-end visibility of a data flow helps to catalog data sources and consumers.
- Automated process consistency. Manual and ad-hoc workflows are more error prone.
- Data pipelines can be nested for complex use cases.
- They improve data quality as processes mature.
- Decision confidence increases when using data sourced from robust pipelines.
Data Pipelines in the Actian Data Platform
The Actian Data Platform can build and schedule data pipelines and has hundreds of prebuilt connectors to sources, including Marketo, Salesforce and ServiceNow. The Actian data platform uses a vectorized columnar database that outperforms alternatives by 7.9x. Data integration technology is built in to support data pipelines that include a graphical designer that enables you to lay out data pipelines to connect, profile, transform and load data. Pipeline steps can be scheduled and run in parallel.
Try the Actian Data Platform for 30-days using the free trial at: https://www.actian.com/free-trial/
Data Pipeline Steps
The data pipeline consists of multiple steps that commonly include the following functions:
Access
For effective analysis, organizations must gather data from multiple data sources across their business. Data sources include transactional systems, log files, sales, and marketing systems, third-party datasets, and more. Once the business has decided what questions their analytics should answer, it will need to identify source data repositories that contain the data. Data integration technology provides the mechanisms to connect to these varied data sources.
Data Profiling
Every data source requires profiling to determine the best utility and access method for extraction. Organizations access data using APIs such as ODBC and data exports into a comma delimited format from proprietary repositories. Profiling tools help to determine data volumes, cardinality, and the best data format for each data field.
Data Ingestion
Data which may still be in multiple files must now be moved into a common repository. Structured and semi-structured files can have different formats, such as record-based, document-based, and XML. It is important to make document formats as uniform as possible to avoid making data pipeline processing overly complex.
Data Cleansing
This step addresses data gaps and inconsistencies. Data is first sorted by timestamp before processing. Data filtering removes records that are not relevant to the analysis. Gaps in the data can be filled using appropriate default values or calculation by extrapolating or interpolating adjacent data values. Out-of-range and null values can skew analysis results. Such outliers must be carefully addressed to ensure the integrity of data analysis.
Formatting
Any remaining data variances, such as decisions on date formats, need to be standardized for easy loading into the analytics platform repository. Common format challenges include formats of US states where one system will spell their names out while others use abbreviations. Data integration tools usually provide functions to change such formats, or simple Python routines exist in GitHub for data engineers to use.
Combining
Where datasets are split among files, they must be merged, and duplicate values reconciled.
Loading
The data can be audited and loaded into the target data platform. Fast loaders can parallelize the loading process. The input data needs to support the logical data model, including any referential integrity constraints, and the database schema design.
Pipeline Parallelism
Parallelization is done within a pipeline step because a previous step must be completed before a new one can begin. Data moves down a pipeline, one stage at a time, until it reaches its destination. The slowest sub-task limits pipeline parallelism as its result needs to be merged with the output of all the other subtasks.