Tuesday, February 20, 2018

Performance in Hive

Performance can in hive can be achieved by 

  1. PARTITIONING

  •  Logically break up data
  •   Anytime a new value id added to a column, It doesn't match any of the existing
       partitions new partitions are created       



     Types of partitioning
        Static
           We should know in advance
           different data would be loaded manually for each partition
        Dynamic
           Determined by hive
           Default max is set by hive
           We can increase thru configuration
   

      2. BUCKETING


  •  Makes sure the splits are of the same size
  • Allows to specify the number of categories up front
  • Records are assigned to individual buckets by applying a hashing function to values in a particular function
  • Buckets in hive are files on HDFS,which store those records whose values map to that bucket          
         Hash function

    • Takes a large range of inputs of value and maps it to a finite numbers of categories
    • The logical organization of buckets on disk is to have a separate file for each bucket

            Advantages of bucketing

    •  Helps sampling of data and join operations
    • Joins efficiently, Becomes more efficient because you know exactly which bucket the corresponding matching row will fall into.
    • We end up scanning only a file insted of the entire dataset

            Implementing buckets

    • We need to specify the no of the buckets
    • we need to use a hash function more moving the records

                  ex  1% 3 = 1 --> sent to bucket 1
                       2% 3 = 2 --> moved to bucket 0
                        3% 3 = 0
     
          Sampling of data
              Involves getting a small portion of the dataset in order to run tests or debug
              queries
     
    Note : Partitions are directories and buckets are files under these directories
 

 3. OPTIMIZE JOIN OPERATIONS 

Joins are map reduce operations in hive
    We can optimize join in 2 ways.
          1) Reducing the amount of data that is held in memory while performing join
             Smaller the data held in memory, faster is the lookup for specific records in the table
             500gb joined with 5gb
              Smaller table should be held in memory

          2) Eliminating the reduce phase by structuring the join as a map-only operation.

4. Window functions 

  •  Are syntactical sugar
  • Don't help to make our queries faster,they allow hive queries to be more robust and maintainable by allowing complex queries to be expressed  in simple manner.

3 comments: