Repartitioning RDDs in Spark

Many times due to data skewness we need to repartition data. This also allows us to distribute the data equally in the available partitions properly.

Some Spark operations, for example, default to a small number of partitions, which results in partitions that have too many elements (take too much memory) and don’t offer adequate parallelism. Repartitioning of RDDs can be accomplished with the partitionBy, coalesce, repartition, and repartitionAndSortWithinPartition transformations.

Repartitioning with PartitionBy

partitionBy is available only on pair RDDs. It accepts only one parameter: the desired Partitioner object. If the partitioner is the same as the one used before, partitioning is preserved and the RDD remains the same. Otherwise, a shuffle is scheduled and a new RDD is created.

Repartitioning with coalesce and repartition

coalesce is used for either reducing or increasing the number of partitions. The full method signature is coalesce (numPartitions: Int, shuffle: Boolean = false). The second (optional) parameter specifies whether a shuffle should be performed (false by default). If you want to increase the number of partitions, it’s necessary to set the shuffle parameter to true. The repartitioning algorithm balances new partitions so they’re based on the same number of parent partitions, matching the preferred locality (machines) as much as possible, but also trying to balance partitions across the machines. The repartition transformation is just a coalesce with shuffle set to true.

Repartitioning with repartitionAndSortWithinPartition

The final transformation for repartitioning RDDs is repartitionAndSortWithin-Partition. It’s available only on sortable RDDs (pair RDDs with sortable keys), which are covered later, but we mention it here for the sake of completeness.

It also accepts a Partitioner object and, as its name suggests, sorts the elements within each partition. This offers better performance than calling repartition and manually sorting because part of the sorting can be done during the shuffle. A shuffle is always performed when using repartitionAndSortWithinPartition.

Leave a Comment