- Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model.
 - It consists of computer clusters built from commodity hardware.
 - All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
 
Map
- Involved in fetching parallel information from all the clusters
 - Output would be key value pairs
 - Every map process the data that is present the given machine
 
Reduce
- Works on the data fetched during the map process
 - Usual computation would be average, sum based on the requirement
 - Step to combine the intermediate results
 
Combiner
- Are very similar to the reduce phase. It works on the mapper output before it goes to the reducer.
 - Ex we have n mapper nodes , we would have 10 combiners
 - Output of the combiner is sent to the reducer
 - Takes the load of the reducer to help the process make more efficient
 
As a developer we have to write functions ( only to 2 functions )
Map and Reduce. Hadoop does the rest behind the scene
No comments:
Post a Comment