Tuesday, January 21, 2020

Difference between AWS glue and Hive warehouse




Apache Hive vs AWS Glue: What are the differences?
Apache Hive: Data Warehouse Software for Reading, Writing, and Managing Large Datasets. Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage; AWS Glue:Fully managed extract, transform, and load (ETL) service. A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
Apache Hive and AWS Glue can be primarily classified as "Big Data" tools.
Some of the features offered by Apache Hive are:
  • Built on top of Apache Hadoop
  • Tools to enable easy access to data via SQL
  • Support for extract/transform/load (ETL), reporting, and data analysis



On the other hand, AWS Glue provides the following key features:
  • Easy - AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
  • Integrated - AWS Glue is integrated across a wide range of AWS services.
  • Serverless - AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
Apache Hive is an open source tool with 2.62K GitHub stars and 2.58K GitHub forks. Here's a link to Apache Hive's open source repository on GitHub.
According to the StackShare community, Apache Hive has a broader approval, being mentioned in 27 company stacks & 12developers stacks; compared to AWS Glue, which is listed in 13 company stacks and 7 developer stacks.

No comments:

Post a Comment