What is Parquet?
Apache Parquet is an open-source file format often used for big data in Hadoop clusters. Parquet stores data using a flat compressed, columnar storage data format. The Apache Spark provides high-level APIs for developers to use, including support for Java, Scala, Python and R. Parquet is well suited to efficiently storing nested data structures.
Storage
A Parquet file consists of row groups. Row groups contain a subset of data stored as pages. Pages are grouped into column chunks. Columns chunks contain metadata information that includes the number of defined values, their size, and statistics such as the number of null and min/max values.
Parquet Data Types
Parquet supports multiple 320-bit data types. These include the following:
- BOOLEAN: 1-bit boolean
- INT32: 32-bit signed ints
- INT64: 64-bit signed ints
- INT96: 96-bit signed ints
- FLOAT: IEEE 32-bit floating point values
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays
- FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
Compression in Parquet
Parquet stores large datasets, so compression is an important feature of this file format. Different compression algorithms can be applied per column. The following compression codecs are provided with the base distribution:
- GZIP: is suited to long-term static storage such as data archival. GZIP provides greater compression than Snappy but consumes more CPU resources.
- Snappy: is a good choice for hot data that is often accessed. Decompression speeds for Snappy are shorter than GZIP. Snappy is splittable (unlike GZIP).
- LZO: is a good choice if your application requires high decompression speeds.
Hadoop
Hadoop provides an open-source platform that scales horizontally to accommodate big data workloads. Parquet was developed primarily for Hadoop environments. Many organizations created Hadoop clusters in the heyday of the big data movement. Unfortunately, Hadoop tuning skills have become scarce, so organizations are moving their Parquet data to more modern architectures. Hadoop systems such as Pig, Hive and Spark use Parquet. Spark provides the best API to access Parquet data. Modern data warehousing systems such as the Actian Data Platform use a Spark connector to access Parquet data wherever it resides, on-premise or in the cloud.
Actian Data Platform
Using a Spark connector, the Actian Data Platform can access Parquet as an external object. Predicates can be pushed down to Parquet to improve access speeds. External Parquet data is easy to load because requests can be wrapped in SQL. Loading parquet data into tables provides the fastest access to data thanks to vector processing that loads columnar data into the CPU cache across a server or cluster to maximize the parallel processing of queries. Benchmarks have shown the Actian Data Platform can scale an order of magnitude beyond Hive.
In the examples below, a Parquet external data source is mapped as an external table before loading its data into an internal table for faster application access.
Loading Parquet Data From Google Cloud Storage:
DROP TABLE IF EXISTS pemdata_gs;
CREATE EXTERNAL TABLE pemdata_gs (
timeperiod VARCHAR(20),
flow1 VARCHAR(20),
flow2 VARCHAR(20),
occupancy1 VARCHAR(20),
speed1 VARCHAR(20)
) using spark
WITH
reference=’gs://avpemdata/part*.parquet’,
format=’parquet’;
DROP TABLE IF EXISTS pemdata;
CREATE TABLE pemdata (
timeperiod TIMESTAMP,
flow1 VARCHAR(20),
flow2 VARCHAR(20),
occupancy1 VARCHAR(20),
speed1 VARCHAR(20)
);
Loading Parquet Data From Azure Blob Storage:
DROP TABLE IF EXISTS pemdata_adl;
CREATE EXTERNAL TABLE pemdata_adl (
timeperiod VARCHAR(20),
flow1 VARCHAR(20),
flow2 VARCHAR(20),
occupancy1 VARCHAR(20),
speed1 VARCHAR(20)
) using spark
WITH
reference=’abfs://parquetdata@mydata.dfs.core.windows.net//part*.parquet’,
format=’parquet’;
DROP TABLE IF EXISTS pemdata;
CREATE TABLE pemdata (
timeperiod TIMESTAMP,
flow1 VACHAR(20),
flow2 VARCHAR(20),
occupancy1 VARCHAR(20),
speed1 VARCHAR(20)
);
INSERT INTO pemdata SELECT * FROM pemdata_adl;
Loading Parquet Data From AWS S3 Storage:
DROP TABLE IF EXISTS pemdata_s3;
CREATE EXTERNAL TABLE pemdata_s3 (
timeperiod VARCHAR(20),
flow1 VARCHAR(20),
flow2 VARCHAR(20),
occupancy1 VARCHAR(20),
speed1 VARCHAR(20)
) using spark
WITH
reference=’s3a://avpemdata/part*.parquet’,
format=’parquet’;
DROP TABLE IF EXISTS pemdata;
CREATE TABLE pemdata (
timeperiod TIMESTAMP,
flow1 VARCHAR(20),
flow2 VARCHAR(20),
occupancy1 VARCHAR(20),
speed1 VARCHAR(20)
);
INSERT INTO pemdata SELECT * FROM pemdata_s3;