RDD tracking
Every RDD keeps track of :
- where it came from ?
- All transformation it took to reach it's current state
These steps are called Lineage/DAG of an RDD
Data Visualization
- In Spark, a job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG)
- Dependency graph where every RDD knows its parent RDD and the transformation
Note: All transformation are in memory and none of the transformation are
applied till we access the results
Advantage of Lineage
- Allows RDD's to be reconstructed when nodes crash.
- We start from the source file. Apply all the transformation which are stored and recreate the RDD
- Allows RDD's to be lazily instantiated (materialized) when accessing the results
No comments:
Post a Comment