Thursday, January 18, 2018

Hadoop eco system

  • The Apache Hadoop project actively supports multiple projects intended to extend Hadoop’s capabilities and make it easier to use. 
  • There are several top-level projects to create development tools as well as for managing Hadoop data flow and processing

Data Ingestion

Flume  :A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
Kafka  :A messaging broker that is often used in place of traditional brokers in the Hadoop environment because it is designed for higher throughput and provides replication and greater fault tolerance.
SQOOP : Is a tool designed to transfer data between Hadoop and relational database servers like  MySQL or Oracle


HDFS (Hadoop Distributed File System)  :Is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems
HBASE : Is a distributed, scalable,  distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable

Data Formats

Avro : Is an opinionated format which understands that data stored in HDFS is usually not a simple key/value combo like Int/String. The format encodes the schema of its contents directly in the file which allows you to store complex objects natively.

Parquet (Columnar File Format) Store data adjacent to one another and also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing


Map Reduce: It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

Resource Management

YARN: Is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.


Pig: Is a procedural language for developing parallel processing applications for large data sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce, and automatically generates MapReduce functions.It was originally developed at Yahoo

Hive : Is a data warehousing software that addresses how data is structured and queried in distributed Hadoop clusters.. It provides tools for ETL operations and brings some SQL-like capabilities to the environment

Spark SQL: Apache spark's module for working with structured data.

Spark Mlib: Apache spark's scalable machine learning library.

Graphx: Apache Spark's API for graphs and graph-parallel computation


Solr: Is a standalone enterprise search server with a REST-like API
ElasticSearch: Is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases


Hue: Self service analytics workbench which helps in  browsing, querying and visualizing data.

Tableau : Helps to quickly and easily find valuable insights in their vast Hadoop datasets.
Removes the need for users to have advanced knowledge of query languages by providing a clean visual analysis interface that makes working with big data more manageable for more stakeholders.


ZooKeeper: High performance coordination service for distributed applications

Cluster Management

Hadoop is an open source project and several vendors have stepped in to develop their own distributions on top of Hadoop framework to make it enterprise ready. Some of the famous companies are HortonWorks, Cloudera and MAPR

Other Apache hadoop related open source projects

Ambari  : A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop.
Cassandra : A scalable multi-master database with no single points of failure.
Chukwa : A data collection system for managing large distributed systems.
Impala :The open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.
Mahout : A scalable machine learning and data mining library.
Tajo : A robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large-data sets stored on HDFS and other data sources.
Tez : A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.


  1. Learned a lot of new things from your post! Good creation and HATS OFF to the creativity of your mind. Very interesting and useful blog!
    DevOps Training in Chennai
    DevOps Certification
    DevOps Certification Chennai

    1. Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. big data projects for students But it’s not the amount of data that’s important. Project Center in Chennai It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

      Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Corporate TRaining Spring Framework the authors explore the idea of using Java in Big Data platforms.
      Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

      The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

  2. thank you for the valuable information giving on data science it is very helpful.
    Data Science Training in Hyderabad

  3. your article on data science is very good keep it up thank you for sharing.
    Data Science Training in Hyderabad

  4. Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.

  5. Such organizations can set cutoff points on these credit lines by making danger models for low-pay buyers through state, installment chronicles for different utilities or Mastercards.
    machine learning course

  6. I prefer to study this kind of material. Nicely written information in this post, the quality of content is fine and the conclusion is lovely. Things are very open and intensely clear explanation of issues

    Hadoop Online Training

  7. This is a great article with lots of informative resources. I appreciate your work this is really helpful for everyone. Check out our website Shipping from China to Amazon FBA for more related info!


  8. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.

    Best Angularjs Training in Chennai
    Best Java Training in Chennai
    Best Bigdata Hadoop Training in Chennai
    Best SAS Training in Chennai
    Best Python Training in Chennai
    Best Software Testing Training in Chennai

  9. Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here nice page
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

  10. The hadoop manangement system is important concepts of core.It is described very well.The effective uses of hadoop is point outed.Your valuable contents are making me to come back again your blog.
    Java training in Chennai

    Java training in Bangalore

    Java training in Hyderabad

    Java Training in Coimbatore

    Java Online Training

  11. It's fantastic for me to have a website that is beneficial to my understanding. Thank you, admin.
    Aluminium pipe

  12. I was eager to find this page. I needed to thank you for ones time for this especially awesome read!! I certainly truly preferred all aspects of it and I likewise have you book-set apart to look at new data in your blog.

    best life insurance policy