Friday, March 3, 2017

Overview of Hadoop Map Reduce

  • Apache Hadoop  is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. 
  • It consists of computer clusters built from commodity hardware. 
  • All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

  • Involved in fetching parallel information from all the clusters
  • Output would be key value pairs
  • Every map process the data that is present the given machine

  • Works on the data fetched during the map process
  • Usual computation would be average, sum based on the requirement
  • Step to combine the intermediate results
  • Are very similar to the reduce phase. It works on the mapper output before it goes to the reducer. 
  • Ex we have n mapper nodes , we would have 10 combiners
  • Output of the combiner is sent to the reducer
  • Takes the load of the reducer to help the process make more efficient

As a developer we have to write functions ( only to 2 functions )
Map and Reduce. Hadoop does the rest behind the scene

