Friday, February 9, 2018

Lineage of RDD

RDD tracking 

Every RDD keeps track of :

  1. where it came from ?
  2.  All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Data Visualization

  • In Spark, a job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG)
  • Dependency graph where every RDD knows its parent RDD and the transformation

Note: All transformation are in memory and none of the transformation are
applied till we access the results

Advantage of Lineage

  • Allows RDD's to be reconstructed when nodes crash.
  • We start from the source file. Apply all the transformation which are stored and recreate the RDD
  • Allows RDD's to be lazily instantiated (materialized) when accessing the results

1 comment:

  1. I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance. Salesforce certification in Noida