Web Snippets: Lineage of RDD

Friday, February 9, 2018

Lineage of RDD

RDD tracking

Every RDD keeps track of :

where it came from ?
All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Data Visualization

In Spark, a job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG)
Dependency graph where every RDD knows its parent RDD and the transformation

Note: All transformation are in memory and none of the transformation are
applied till we access the results

Advantage of Lineage

Allows RDD's to be reconstructed when nodes crash.
We start from the source file. Apply all the transformation which are stored and recreate the RDD
Allows RDD's to be lazily instantiated (materialized) when accessing the results

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)