Friday, February 9, 2018

Spark Architecture

Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

4) Coordinator recieves the task and breaks it into discrete tasks.
5) These tasks along with the data that they work on are assigned to worker machines to be executed

6) Every RDD has the potential to have millions of records
7) Every worker machine might be assigned the task of performing the same transformation on diffrent set of data.

Architecture consists of 2 components

Coordinator process that executes the user program
  • Analyses the program and breaks it into discrete tasks that can be performed in a distributed manner
  • Verifies the clusters and assigns or schedules task on these nodes (executors)
  • Runs in its own Java process
  • Only one coordinator for spark application

 The worker process responsible for running individual jobs
  • Receives the tasks they have to execute from the driver program and runs the job in a Spark job in different nodes in the cluster
  • Provide in-memory storage for RDD so that tasks run close to the data
  • Each executor runs in its own Java process

Cluster Manager

  • Launches the driver and executor programs
  • Pluggable and the built-in one is called the standalone cluster manager
 Ex: Yarn can be plugged in
Driver and Executor together makes up the spark application

  • Task which collects input data from different sources
  • Spark allocates a receiver for each input source
  • Special task that run on the executors 


  1. Task run within executors
  2. Collect data from input source and save them as RDDs in memory, so that spark can replicate the collected data to another executor for fault tolerance  
  3. If we have 2 sources then 2 exectors would be assigned for the streaming data