Thursday, January 18, 2018

Hadoop eco system

  • The Apache Hadoop project actively supports multiple projects intended to extend Hadoop’s capabilities and make it easier to use. 
  • There are several top-level projects to create development tools as well as for managing Hadoop data flow and processing

Data Ingestion

Flume  :A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
Kafka  :A messaging broker that is often used in place of traditional brokers in the Hadoop environment because it is designed for higher throughput and provides replication and greater fault tolerance.
SQOOP : Is a tool designed to transfer data between Hadoop and relational database servers like  MySQL or Oracle


HDFS (Hadoop Distributed File System)  :Is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems
HBASE : Is a distributed, scalable,  distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable

Data Formats

Avro : Is an opinionated format which understands that data stored in HDFS is usually not a simple key/value combo like Int/String. The format encodes the schema of its contents directly in the file which allows you to store complex objects natively.

Parquet (Columnar File Format) Store data adjacent to one another and also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing


Map Reduce: It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

Resource Management

YARN: Is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.


Pig: Is a procedural language for developing parallel processing applications for large data sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce, and automatically generates MapReduce functions.It was originally developed at Yahoo

Hive : Is a data warehousing software that addresses how data is structured and queried in distributed Hadoop clusters.. It provides tools for ETL operations and brings some SQL-like capabilities to the environment

Spark SQL: Apache spark's module for working with structured data.

Spark Mlib: Apache spark's scalable machine learning library.

Graphx: Apache Spark's API for graphs and graph-parallel computation


Solr: Is a standalone enterprise search server with a REST-like API
ElasticSearch: Is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases


Hue: Self service analytics workbench which helps in  browsing, querying and visualizing data.

Tableau : Helps to quickly and easily find valuable insights in their vast Hadoop datasets.
Removes the need for users to have advanced knowledge of query languages by providing a clean visual analysis interface that makes working with big data more manageable for more stakeholders.


ZooKeeper: High performance coordination service for distributed applications

Cluster Management

Hadoop is an open source project and several vendors have stepped in to develop their own distributions on top of Hadoop framework to make it enterprise ready. Some of the famous companies are HortonWorks, Cloudera and MAPR

Other Apache hadoop related open source projects

Ambari  : A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop.
Cassandra : A scalable multi-master database with no single points of failure.
Chukwa : A data collection system for managing large distributed systems.
Impala :The open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.
Mahout : A scalable machine learning and data mining library.
Tajo : A robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large-data sets stored on HDFS and other data sources.
Tez : A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.