- Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model.
- It consists of computer clusters built from commodity hardware.
- All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
Map
- Involved in fetching parallel information from all the clusters
- Output would be key value pairs
- Every map process the data that is present the given machine
Reduce
- Works on the data fetched during the map process
- Usual computation would be average, sum based on the requirement
- Step to combine the intermediate results
Combiner
- Are very similar to the reduce phase. It works on the mapper output before it goes to the reducer.
- Ex we have n mapper nodes , we would have 10 combiners
- Output of the combiner is sent to the reducer
- Takes the load of the reducer to help the process make more efficient
As a developer we have to write functions ( only to 2 functions )
Map and Reduce. Hadoop does the rest behind the scene
No comments:
Post a Comment