Sunday, June 24, 2018

Permanent UDF in hive


If we create a temporary function then it would be available for the current cli session.
Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';


HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.  
Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same. 



EXAMPLES

1. Store the JAR file in any HDFS location instead of local. This is to make sure that all the nodes, always have access to the JAR files.
    $> hadoop fs -put MaskingData.jar ;

2. Next, we create a permanent function with JAR path of HDFS included.
hive> CREATE FUNCTION MASK AS 'HiveUDF.Masking' using JAR 'hdfs://localhost:8020/user/cloudera/MaskingData.jar';

Please note in the highlighted are that when function is created, if moves the JAR from HDFS to local file system (in /tmp/ location, but the resource is added with hdfs:// location reference. 


First let’s run it in the same session in which we created the function:

hive> SELECT category_id, category_name, MASK(category_name) FROM categories LIMIT 10;

The function is working as expected and the category_name field is masked.
Next, we log out the of the session and log back in (or login from any other edge node if possible) and perform the same test. If this was a temporary function, it would have been lost as soon as we logged out of the session.

Let's see how Permanent Functions work

hive> exit;
$> hive
hive> SELECT category_id, category_name, MASK(category_name) FROM categories LIMIT 10;

Please note that even in the new session, as soon as we use the function MASK, hive automatically fetched and adds the required JAR file.
This obviously means that the location of the JAR file should not be changed from where it was defined while function creation. 


1 comment:

  1. Very informative. Could you please explain how did you created the MaskingData.jar.

    ReplyDelete