Spark Check Point

Hanzhong

2023-09-25

Spark check point

In SparkContext, when a job is completed, rdd.doCheckpoint() will be called.

 /**
  * Run a function on a given set of partitions in an RDD and pass the results to the given
  * handler function. This is the main entry point for all actions in Spark.
  *
  * @param rdd target RDD to run tasks on
  * @param func a function to run on each partition of the RDD
  * @param partitions set of partitions to run on; some jobs may not want to compute on all
  * partitions of the target RDD, e.g. for operations like `first()`
  * @param resultHandler callback to pass each result to
  */  
def runJob[T, U: ClassTag](
     rdd: RDD[T],
     func: (TaskContext, Iterator[T]) => U,
     partitions: Seq[Int],
     resultHandler: (Int, U) => Unit): Unit = {
   ...
   	run job
   ...
   // job completed
   rdd.doCheckpoint()
 }

Check point code in RDD.

 /**
  * Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD
  * has completed (therefore the RDD has been materialized and potentially stored in memory).
  * doCheckpoint() is called recursively on the parent RDDs.
  */  
private[spark] def doCheckpoint(): Unit = {
   RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
     if (!doCheckpointCalled) {
       doCheckpointCalled = true
       if (checkpointData.isDefined) {
         if (checkpointAllMarkedAncestors) {
           // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint
           // them in parallel.
           // Checkpoint parents first because our lineage will be truncated after we
           // checkpoint ourselves
           dependencies.foreach(_.rdd.doCheckpoint())
         }
         checkpointData.get.checkpoint()
       } else {
         dependencies.foreach(_.rdd.doCheckpoint())
       }
     }
   }
 }