Web Snippets: Performance in Hive

Tuesday, February 20, 2018

Performance in Hive

Performance can in hive can be achieved by

PARTITIONING

Logically break up data
Anytime a new value id added to a column, It doesn't match any of the existing
partitions new partitions are created

Types of partitioning
Static
We should know in advance
different data would be loaded manually for each partition
Dynamic
Determined by hive
Default max is set by hive
We can increase thru configuration

2. BUCKETING

Makes sure the splits are of the same size
Allows to specify the number of categories up front
Records are assigned to individual buckets by applying a hashing function to values in a particular function
Buckets in hive are files on HDFS,which store those records whose values map to that bucket

Hash function

Takes a large range of inputs of value and maps it to a finite numbers of categories
The logical organization of buckets on disk is to have a separate file for each bucket

Advantages of bucketing

Helps sampling of data and join operations
Joins efficiently, Becomes more efficient because you know exactly which bucket the corresponding matching row will fall into.
We end up scanning only a file insted of the entire dataset

Implementing buckets

We need to specify the no of the buckets
we need to use a hash function more moving the records

ex 1% 3 = 1 --> sent to bucket 1
2% 3 = 2 --> moved to bucket 0
3% 3 = 0

Sampling of data
Involves getting a small portion of the dataset in order to run tests or debug
queries

Note : Partitions are directories and buckets are files under these directories

3. OPTIMIZE JOIN OPERATIONS

Joins are map reduce operations in hive
We can optimize join in 2 ways.
1) Reducing the amount of data that is held in memory while performing join
Smaller the data held in memory, faster is the lookup for specific records in the table
500gb joined with 5gb
Smaller table should be held in memory

2) Eliminating the reduce phase by structuring the join as a map-only operation.

4. Window functions

Are syntactical sugar
Don't help to make our queries faster,they allow hive queries to be more robust and maintainable by allowing complex queries to be expressed in simple manner.

3 comments:

gp007December 25, 2018 at 2:48 AM
You won't believe me, but I was planning to write a blog very similar to the one you have posted here. Great work!

Hadoop Training In Chennai

Python Training In Chennai
ReplyDelete
Replies
priya rajeshFebruary 23, 2019 at 12:24 AM
Excellent info, I really appreciate your work. Continue sharing more with latest updates.
Data Science Course in Chennai
Data Science Certification in Chennai
Data Science Training in Tambaram
Machine Learning Training in Chennai
Machine Learning Training in Velachery
R Programming Training in Chennai
Data Science Course in Chennai
Data Science Training in Chennai
ReplyDelete
Replies
sheela rajeshApril 14, 2019 at 3:14 AM
Thanks for sharing your great ideas with us and update more informations further.
SEO Training in Chennai
SEO Training Institute in Chennai
JAVA Training in Chennai
Python Training in Chennai
Hadoop Training in Chennai
IOS Training in Chennai
seo training in chennai
SEO Training in Adyar
ReplyDelete
Replies

Add comment

Web Snippets

Labels