Friday, February 9, 2018

Lineage of RDD


RDD tracking 

Every RDD keeps track of :

  1. where it came from ?
  2.  All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD



Data Visualization

  • In Spark, a job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG)
  • Dependency graph where every RDD knows its parent RDD and the transformation


Note: All transformation are in memory and none of the transformation are
applied till we access the results


Advantage of Lineage

  • Allows RDD's to be reconstructed when nodes crash.
  • We start from the source file. Apply all the transformation which are stored and recreate the RDD
  • Allows RDD's to be lazily instantiated (materialized) when accessing the results




1 comment:

  1. I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance. Salesforce certification in Noida

    ReplyDelete