Friday, January 26, 2018

Map Reduce Data Flow

Pre loaded local input data and Mapping
  • MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.
  • Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.
  • Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.
  • Each mapper loads the set of file local to that machine and process them.

Intermediate data from the mappers and shuffling
  • When mapping phase is completed. the intermediate (key,value) pairs must be exchanged between machines to send all values with the same key to a single reducer
Reducing process
  • The reduce task are spread across the same nodes across the same nodes across the clusters.This is the only task in MapReduce.
  • Individual map task do no exchange information with one another, nor they are aware of  one anoothers'existence 
  • The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. 
If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.

Word Count 

In the example below we can see the data flow for word count

Note: Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks.