Friday, February 9, 2018

Lineage of RDD


RDD tracking 

Every RDD keeps track of :

  1. where it came from ?
  2.  All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD



Data Visualization

  • In Spark, a job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG)
  • Dependency graph where every RDD knows its parent RDD and the transformation


Note: All transformation are in memory and none of the transformation are
applied till we access the results


Advantage of Lineage

  • Allows RDD's to be reconstructed when nodes crash.
  • We start from the source file. Apply all the transformation which are stored and recreate the RDD
  • Allows RDD's to be lazily instantiated (materialized) when accessing the results




No comments:

Post a Comment