Tuesday, January 21, 2020

Spark Executors and Drivers

  • Our code that defines which datasources to consume from and which data sources to consume to which is driving all our execution
  • Responsible for getting the spark context
  • It would understand how to communicate with the cluster manager.(Yarn mesos or stand-alone)

  • Requests resources which are the worker node. This worker node contains executor
  • Driver is acting as a scheduler
  • Its a jvm, which can spawn different tasks
  • These tasks performs specific tasks what the driver tells them
  • Each executor carries a memory that it keeps as cache. This is known as spark cache
  • There is a portion of jvm memory inside the executor
Note: Spark data off heap, so outside the jvm


  • Requests resources from a scheduler (YARN)
  • Once the program gets the resources it needs in the form of executors. Its the job of the driver to schedule the execution of tasks that it sends to the executors
     There are 2 types of scheduler
    • Responsible for responding to resource requests for executors.

          SPARK SCHEDULER(Driver)
    • Responsible translating transformation and actions into a DAG

There are 2 types of cluster modes
  • When submitted as client mode, the machine or the client host that you are submitting your application from actually hosts the driver code
  • Their requests are made to yarn to submit the application and request the resources and then YARN creates an application master and containers that would communicate back to the application master, which the becomes aware of.
  • Host is responsible for the livelihood of the driver. Driver goes down then the whole application goes down.
  • Availability depends on client
  • Interactive (Spark Shell/REPL)

  • Driver itself (spark scheduler) is introduced as a container within your cluster manager
  • Driver actually leaves within the boundary of application master
  • It communicates with executors with other containers just as it does in the client mode
  • Cluster manger manages drivers livelihood
  • Better availability guarantees  
  • Non interactive
  • A container is an allocation or share of memory and cpu. 
  • One job may need multiple containers
  • Containers will be allocated across the cluster depending upon the availability
  • The tasks will be executed inside the container

  • Each of these containers is spark executor. This carries different tasks and cache
  • Spark splits out its body of work that it creates partitions that it has to work on.
  • Every partitions consumes one of this task slots (similar to input splits)

  • Stand alone application that YARN runs in a YARN container to manage spark application running in a YARN cluster
  • Framework specific library tasked with negotiating cluster resources from the YARN ResourceManager and working with YARN nodemanger to execute and monitor the task.

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.
And from here:

import org.apache.spark.sql.SparkSession
object Ingest {
  // Main Method
  def main(args: Array[String])

     val spark = SparkSession
      .appName("Spark SQL basic example").getOrCreate()


local : Run Spark locally with one worker thread (i.e. no parallelism at all).

local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).

local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)

local[*] : Run Spark locally with as many worker threads as logical cores on your machine.

local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.

spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.

spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.

mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.

yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable


No comments:

Post a Comment