Sunday, June 24, 2018

Permanent UDF in hive


If we create a temporary function then it would be available for the current cli session.
Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';


HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.  
Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same. 

Create UDF functions in hive



HIVE UDF FUNCTIONS
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical and relational on the operands of table column names.

We can write the UDF function in java as shown below.
In this example,  we are replacing a character string into "*" . We are masking characters which should not be shown to the user.

Eclipse market place for scala


We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED
  • In the eclipse IDE we need to go the help tab and click on Eclipse marketplace


Linux file system hierarchy


If you look at the Linux file hierarchy, you find the following :

/bin - Common binaries

/sbin - Binaries used for system administration are placed here.

/boot - Static files of the boot loader. Usually it contain the Linux kernel, Grub boot loader files and so on.

/dev - Device files such as your CD drive, hard disk, and any other physical device. 

In Linux/Unix, the common premise is that everything is a file.

/home - User HOME directories are found here. In unices like FreeBSD, the HOME directories are found in /usr/home. And in Solaris it is in /export. So quite a big difference here.

Install SBT using yum


sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

  • Native support for compiling Scala code and integrating with many Scala test frameworks
  • Continuous compilation, testing, and deployment
  • Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
  • Build descriptions written in Scala using a DSL


Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark


SPARK PROCESSING
  • Distributed and parallel processing
  • Each executor has separate copies (variables and functions)
  • No propagation data back to the driver (Except on certain necessary cases)

Wednesday, June 13, 2018

Simple demo for spark streaming















USING PYSPARK

1. Login to your shell and open pyspark
[cloudera@quickstart ~]$ pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) 


2. Run Netcat
[cloudera@quickstart ~]$  nc -l localhost 2182