Friday, January 26, 2018

Yarn Architecture


Hadoop version 2 came with a fundamental change to the architecture.The framework was divided into two. Mapreduce and Yarn

MapReduce: Responsible for what operations you want to perform on the data

YARN: Yet Another Resource Negotiator
  • Determines and responsible for coordinating all the tasks running on all the nodes in the cluster
  • Framework responsible for providing the computational resources which includes ( CPU, memory,etc) needed for application execution
  • Assigns new task to the node based on the existing capacity. If nodes have failed and all the process in that nodes have stopped, it would assign new nodes for that task
  • It is a better resource negotiator


YARN Components
      :Is made up of two components

       Resource manager:
    • Runs on a single master node
    • Schedules tasks across nodes
       Node manager:
    • Runs on all the other nodes
    • Manages tasks on the individual node
Container:
  • All process on a node are run within a container
  • Its a logical container, logical unit for resources the the process needs - memory, CPU etc
  • Is defined by resources
  • Responsible for running any task assigned to it.It executes that application
  • One node manager can have more then one containers
Note: When a new process is required to be spun off on a node the resource request for that process  is made in the form of container

Application Master Process
  • After a container has been assigned on a node manager, the resource manager(master process) starts an application process master within the container
  • Responsible for performing the computation and processing the data
  • In the case of Map Reduce the application master process will be a Mapper process or the reduce logic
  • Responsible for determining if additional resources are required to complete the task. (If we have pending mapper or reduce jobs which needs to be run)
 Note: 
1) If more job needs to be run then the application master requests the resource manager running on the master node for additional resources (containers) . Now node manager requests containers for new mappers and reducers. This request would have cpu requirement, memory requirement etc.

2) Resource manager always scans and looks for new nodes which are available. Individual node manager would not have this information

Node manager and resource manager work together to accomplish parallel processing



3 comments: