Performance can in hive can be achieved by
- PARTITIONING
- Logically break up data
- Anytime a new value id added to a column, It doesn't match any of the existing
partitions new partitions are created
Types of partitioning
Static
We should know in advance
different data would be loaded manually for each partition
Dynamic
Determined by hive
Default max is set by hive
We can increase thru configuration
2. BUCKETING
- Makes sure the splits are of the same size
- Allows to specify the number of categories up front
- Records are assigned to individual buckets by applying a hashing function to values in a particular function
- Buckets in hive are files on HDFS,which store those records whose values map to that bucket
- Takes a large range of inputs of value and maps it to a finite numbers of categories
- The logical organization of buckets on disk is to have a separate file for each bucket
Advantages of bucketing
- Helps sampling of data and join operations
- Joins efficiently, Becomes more efficient because you know exactly which bucket the corresponding matching row will fall into.
- We end up scanning only a file insted of the entire dataset
Implementing buckets
- We need to specify the no of the buckets
- we need to use a hash function more moving the records
ex 1% 3 = 1 --> sent to bucket 1
2% 3 = 2 --> moved to bucket 0
3% 3 = 0
Sampling of data
Involves getting a small portion of the dataset in order to run tests or debug
queries
Note : Partitions are directories and buckets are files under these directories
3. OPTIMIZE JOIN OPERATIONS
Joins are map reduce operations in hiveWe can optimize join in 2 ways.
1) Reducing the amount of data that is held in memory while performing join
Smaller the data held in memory, faster is the lookup for specific records in the table
500gb joined with 5gb
Smaller table should be held in memory
2) Eliminating the reduce phase by structuring the join as a map-only operation.
4. Window functions
- Are syntactical sugar
- Don't help to make our queries faster,they allow hive queries to be more robust and maintainable by allowing complex queries to be expressed in simple manner.
You won't believe me, but I was planning to write a blog very similar to the one you have posted here. Great work!
ReplyDeleteHadoop Training In Chennai
Python Training In Chennai
Excellent info, I really appreciate your work. Continue sharing more with latest updates.
ReplyDeleteData Science Course in Chennai
Data Science Certification in Chennai
Data Science Training in Tambaram
Machine Learning Training in Chennai
Machine Learning Training in Velachery
R Programming Training in Chennai
Data Science Course in Chennai
Data Science Training in Chennai
Thanks for sharing your great ideas with us and update more informations further.
ReplyDeleteSEO Training in Chennai
SEO Training Institute in Chennai
JAVA Training in Chennai
Python Training in Chennai
Hadoop Training in Chennai
IOS Training in Chennai
seo training in chennai
SEO Training in Adyar