Spark - RangePartitioner rangeBounds 生成源码分析 & 实践

2023-09-28 14:46:04

引言

Spark 的 RangePartitioner 是一个常用的分区器，它根据 key 的范围对数据进行分区。RangePartitioner 的构造函数需要两个参数：numPartitions 和 rangeBounds。numPartitions 指定了分区的数量，而 rangeBounds 是一个数组，它指定了每个分区的范围。

rangeBounds 的生成

RangePartitioner 的 rangeBounds 是通过采样数据来生成的。采样数据的过程如下：

从数据集中随机选择一个子集。
对子集中的数据进行排序。
将排序后的数据分成 numPartitions 个相等大小的子集。
每个子集的最小值和最大值就是 rangeBounds 的一个元素。

源码分析

RangePartitioner 的源码位于 spark-core/src/main/scala/org/apache/spark/rdd/Partitioner.scala。rangeBounds 的生成代码位于 computeBoundaries() 方法中。computeBoundaries() 方法首先从数据集中随机选择一个子集，然后对子集中的数据进行排序。接下来，computeBoundaries() 方法将排序后的数据分成 numPartitions 个相等大小的子集。最后，computeBoundaries() 方法计算每个子集的最小值和最大值，并将其作为 rangeBounds 的一个元素。