Web Snippets: Spark Executors and Drivers

SPARK CONSISTS OF DRIVER PROGRAM

Our code that defines which datasources to consume from and which data sources to consume to which is driving all our execution
Responsible for getting the spark context
It would understand how to communicate with the cluster manager.(Yarn mesos or stand-alone)

CLUSTER MANAGER

Requests resources which are the worker node. This worker node contains executor
Driver is acting as a scheduler

EXECUTOR

Its a jvm, which can spawn different tasks
These tasks performs specific tasks what the driver tells them

EXECUTOR MEMORY (SPARK CACHE)

Each executor carries a memory that it keeps as cache. This is known as spark cache
There is a portion of jvm memory inside the executor

Note: Spark data off heap, so outside the jvm

DRIVER

Requests resources from a scheduler (YARN)
Once the program gets the resources it needs in the form of executors. Its the job of the driver to schedule the execution of tasks that it sends to the executors

SCHEDULING

There are 2 types of scheduler

CLUSTER MANAGER SCHEDULER (YARN)

Responsible for responding to resource requests for executors.

SPARK SCHEDULER(Driver)

Responsible translating transformation and actions into a DAG

CLUSTER MODE

There are 2 types of cluster modes

1. CLIENT MODE

When submitted as client mode, the machine or the client host that you are submitting your application from actually hosts the driver code
Their requests are made to yarn to submit the application and request the resources and then YARN creates an application master and containers that would communicate back to the application master, which the becomes aware of.
Host is responsible for the livelihood of the driver. Driver goes down then the whole application goes down.
Availability depends on client
Interactive (Spark Shell/REPL)

2. CLUSTER MODE

Driver itself (spark scheduler) is introduced as a container within your cluster manager
Driver actually leaves within the boundary of application master
It communicates with executors with other containers just as it does in the client mode
Cluster manger manages drivers livelihood
Better availability guarantees
Non interactive

CONTAINER

A container is an allocation or share of memory and cpu.
One job may need multiple containers
Containers will be allocated across the cluster depending upon the availability
The tasks will be executed inside the container

PARTITIONS

Each of these containers is spark executor. This carries different tasks and cache
Spark splits out its body of work that it creates partitions that it has to work on.
Every partitions consumes one of this task slots (similar to input splits)

APPLICATION MASTER

Stand alone application that YARN runs in a YARN container to manage spark application running in a YARN cluster
Framework specific library tasked with negotiating cluster resources from the YARN ResourceManager and working with YARN nodemanger to execute and monitor the task.

MASTER IN SPARKSESSION

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.

And from here:

import org.apache.spark.sql.SparkSession
object Ingest {
  // Main Method
  def main(args: Array[String])
  {

     val spark = SparkSession
      .builder()
       .master("local[2]")
      .appName("Spark SQL basic example").getOrCreate()

   }
}

local : Run Spark locally with one worker thread (i.e. no parallelism at all).

local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).

local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)

local[*] : Run Spark locally with as many worker threads as logical cores on your machine.

local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.

spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.

spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.

mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.

yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable

YARN NODE MANAGER

YARN RESOURCE MANAGER

Web Snippets

Labels

Tuesday, January 21, 2020

Spark Executors and Drivers

No comments:

Post a Comment

Labels

Blog Archive