Web Snippets

Sunday, June 24, 2018

Permanent UDF in hive

If we create a temporary function then it would be available for the current cli session.

Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';

HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.

Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same.

HIVE UDF FUNCTIONS
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical and relational on the operands of table column names.

We can write the UDF function in java as shown below.
In this example, we are replacing a character string into "*" . We are masking characters which should not be shown to the user.

We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED

In the eclipse IDE we need to go the help tab and click on Eclipse marketplace

If you look at the Linux file hierarchy, you find the following :

/bin - Common binaries

/sbin - Binaries used for system administration are placed here.

/boot - Static files of the boot loader. Usually it contain the Linux kernel, Grub boot loader files and so on.

/dev - Device files such as your CD drive, hard disk, and any other physical device.

In Linux/Unix, the common premise is that everything is a file.

/home - User HOME directories are found here. In unices like FreeBSD, the HOME directories are found in /usr/home. And in Solaris it is in /export. So quite a big difference here.

sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

Native support for compiling Scala code and integrating with many Scala test frameworks
Continuous compilation, testing, and deployment
Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
Build descriptions written in Scala using a DSL

Accumulators and Broadcast Variables in Spark

SPARK PROCESSING

Distributed and parallel processing
Each executor has separate copies (variables and functions)
No propagation data back to the driver (Except on certain necessary cases)

Simple demo for spark streaming

USING PYSPARK

1. Login to your shell and open pyspark

[cloudera@quickstart ~]$ pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)

2. Run Netcat

[cloudera@quickstart ~]$  nc -l localhost 2182

Web Snippets

Labels

Sunday, June 24, 2018

Permanent UDF in hive

Create UDF functions in hive

Eclipse market place for scala

Linux file system hierarchy

Install SBT using yum

Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark

Wednesday, June 13, 2018

Simple demo for spark streaming

Labels

Blog Archive