Efficient ETL in an Analytical Database?
Actian Corporation
June 27, 2016
Recently I worked on a POC that required some non-standard thinking. The challenge was that the customer’s use case did not only need high-performance SQL analytics but also a healthy amount of ETL (Extract, Transform, and Load). More specifically, the requirement was for ELT (or even ETLT if we want to be absolutely precise).
Why “might” this have been an issue? Well typically analytical databases and ETL-style processing don’t play well together; the latter tends to be row orientated while the typical analytical database definitely prefers to deal with data in a “chunky” fashion. Typically analytical databases are able to load data in bulk at very high speed but tend to offer modest row-by-row throughput.
Another typical characteristic is the use of table-level write locking – serializing write transactions to one at a time. This is generally accepted as the use cases for analytical databases tend to be about queries rather than any kind of transaction processing. However, when some form of ETL is required it is perhaps even more problematic than the row-by-row throughput as it requires the designer and the loading tool to be aware of this characteristic. The designer often has to “jump through hoops” to figure out how to get the data into the analytical database in a way that other team members can understand and that the tool can deliver.
I’m setting the scene here for the “big reveal” that the Actian Vector processing databases do not suffer from these drawbacks. They can deliver both high-end analytical capabilities and offer “OLTP capabilities” in the manner of the HTAP (Hybrid Transactional/Analytical Processing) technologies.
Note the quotes around “OLTP capabilities” – just to be clear we at Actian wouldn’t position these as high-performance OLTP databases, we’re just saying that the capabilities (row-level locking and concurrent tables modifications) are there even though the database is a columnar, in-memory, vector processing engine).
However they are viewed, it was these capabilities that allowed us to achieve the customer’s goals – albeit with a little cajoling. In the rest of this post, I’ll describe the steps we went through and the results we achieved. If you’re not currently a user of either Actian Vector or Actian Vector in Hadoop ((VectorH) then you might just skip to the end, however if you are using the technology then read on.
Configuring for ETL
So coming back to the use case, this customer’s requirement was to load large volumes of data from different sources in parallel into the same tables. Now above we said that we offer “OLTP capabilities”, however out-of-the-box the configuration is more aligned to deal with one bulk update per table – we needed to alter the configuration to deal with multiple concurrent bulk modifications.
At their core, Actian databases have a columnar architecture and in all cases the underlying column store is modified in a single transaction. The concurrent update feature comes from some clever technology that buffers updates in-memory in a seamless and ACID compliant way. The default configuration assumes a small memory model and so routes large scale changes directly to the column store while smaller updates are routed to the in-memory buffer. The maintenance operations performed on the in-memory buffer – such as flushing changes to the column store – are triggered by resource thresholds set in the configuration.
It’s here where, with the default configuration, you can face a challenge – situations arise where large scale updates sent directly to the column store can clash with the maintenance routine of the in-memory buffer. To make this work well we need to adjust to the configuration to cater for the fact that there is – almost certainly – more memory than what the default configuration assumes. Perhaps the installer could set these accordingly, but with a large installed base it’s safer to keep the behaviour the same to keep consistency between versions.
So we needed to do two things; first we wanted to route all changes through the in-memory buffer, and second configure the in-memory buffer large enough to cater for the amount of data we were going to load. We might also have done a third thing which is to make the maintenance routines manual and bake the commands to trigger these into the ETL processes themselves, giving them complete control of what happens when.
Routing all changes through the in-memory buffer is done using the insertmode setting. Changing this means that bulk operations that would normally go straight to the column store now go through the in-memory buffer allowing multiple bulk operations to be done concurrently.
Sizing the in-memory buffer is simply a matter of adjusting the threshold values to match the amount of memory available or as suggested above making the process completely in control of the ETL process.
The table below describes the configuration options that effect the process:
Option | Meaning |
update_propagation | Is automatic maintenance enabled. |
max_global_update_memory | Controls the amount of memory that can be used by the in-memory buffer. |
max_update_memory_per_transaction | As above per transaction. |
max_table_update_ratio | Threshold for the percentage of a table held in the buffer before the maintenance process is initiated. |
min_propagate_table_count | Minimum row count a table must have to be considered by the maintenance process. |
To trigger the maintenance process manually execute:
modify <table> to combine
If you want to see more technical details of how to implement this processing, a knowledge base article available here:
Results
The initial run load of the customer’s data – with the default configuration – took around 13 minutes. With some tuning of the memory parameters to have the maintenance routine invoked less often this came down to just over 9 minutes. Switching to all in-memory (still a single stream at this point) moved the needle to just under 9 minutes. This was an interesting aspect of the testing – routing everything through the in-memory buffer did not slow down the process, in fact it improved the time, albeit by a small factor.
Once the load was going via the in-memory buffer the load could be done in parallel streams. The final result was being able to load the data in just over a minute via eight parallel streams. This was a nice result given that the customers’ existing – OLTP based – system took over 90 minutes to load the same data with ten parallel streams.
Conclusion
Analytical databases typically face challenges when trying to load data via traditional ETL tools and methods – being characterised by low row-by-row processing speed and, most notably, table level write locking.
Actian’s vector processing databases have innovative technology that allows them avoid these problems and offer “OLTP capabilities”. While stopping short of targeting OLTP use cases, these capabilities allow Actian’s databases to utilise high performance loading concurrently and thereby provide good performance for ETL workloads.
Subscribe to the Actian Blog
Subscribe to Actian’s blog to get data insights delivered right to you.
- Stay in the know – Get the latest in data analytics pushed directly to your inbox
- Never miss a post – You’ll receive automatic email updates to let you know when new posts are live
- It’s all up to you – Change your delivery preferences to suit your needs