Resilient Distributed Dataset

Smiling woman holding tablet, team discussing resilient distributed dataset.

Apache Spark open-source project uses a Resilient Distributed Dataset (RDD) structure. It is an immutable distributed collection of objects partitioned across nodes. The RDD can be processed in parallel across a cluster of computers and is designed to be fault-tolerant, meaning they can automatically recover from node failures. RDDs can be operated in parallel with a low-level Application Programming Interface (API) that offers transformations and actions.

Why is a Resilient Distributed Dataset Important?

Resilient Distributed Datasets are critical because they provide a robust framework for distributed data processing with built-in capabilities like fault tolerance, parallel processing, and flexibility in handling various types of data. They form the backbone of many enterprise applications and ML workloads that require reliable and efficient processing of large datasets.

Applications for Resilient Distributed Datasets

RDDs are presented through the Spark system to support iterative algorithms and interactive data mining tools. Below are some examples of how RDDs are used in the real world*.

Video Streaming Analytics

A video streaming company used an RDD to provide usage analytics. Customer queries were loaded as subsets of grouped data that match search criteria needed to provide aggregations such as averages, percentiles, and COUNT DISTINCT functions. The filtered data was loaded once into an RDD so the transformation could be applied to the whole RDD.

Congestion Prediction

A University of California – Berkley study used an RDD to parallelize a learning algorithm for inferring road traffic congestion from sporadic automobile GPS measurements. Using a traffic model, the system can estimate congestion by inferring the expected time to travel across individual road links.

Social Media Spam Classification

The Monarch project at Berkeley used Spark to identify link spam in Twitter messages. They implemented a logistic regression classifier on top of Spark using reduceByKey to sum the gradient vectors in parallel on the Spark cluster.

RDD Built-in Functions

The following provides a sample of the kinds of functions available for data loaded into an RDD:

  • Return the union of this RDD and another one.
  • Aggregate the elements of each partition and then the results for all the partitions.
  • Persist this RDD.
  • Return the Cartesian product of this RDD.
  • Return an array that contains all of the elements in this RDD.
  • The RDD creation date and time.
  • Return the number of elements in the RDD.
  • Return the count of each unique value in this RDD as a map of (value, count) pairs.
  • Return a new RDD containing the distinct elements in this RDD.
  • Return a new RDD containing only the elements that satisfy a predicate.
  • Return the first element in this RDD.
  • Aggregate the elements of each partition and then the results for all the partitions.
  • Return an RDD of grouped items.
  • Internal method to this RDD will read from the cache if applicable or otherwise compute it.
  • Return a new RDD by applying a function to all elements of this RDD.
  • Return a new RDD by applying a function to each partition of this RDD.
  • Return a sampled subset of this RDD.
  • Save this RDD as a text file using string representations of elements.
  • Return an array that contains all of the elements in this RDD.
  • Return the union of this RDD and another one.

Using RDDs with Actian Data Platform

The Actian Data Platform uses RDDs via the built-in Spark connector. The Vector Database can access data in any of the 50+ formats that the Spark connector provides. Data in formats such as PARQ and ORC are accessed as external tables. Predicates can be pushed down into external tables to provide selective access.

*Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Electrical Engineering and Computer Sciences University of California at Berkeley.