Sunday, June 24, 2018

Simple transformations in Spark



MAP :
  • map    is    a    transformation    operation    in    Spark    hence    it    is    lazily    evaluated   
  • It    is    a    narrow    operation    as    it    is    not    shuffling    data    from    one    partition    to    multiple partitions
Note:  In this example map takes the values from the array and assigns 1 to all the elements in the array
scala> val x=sc.parallelize(List("spark","rdd","example","sample","example"),3)
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4]  
at parallelize at <console>:27

scala> val y=x.map(x=>(x,1))
y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] 
at map at <console>:29

scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), 
(sample,1), (example,1))



Running cron Job




What is cron?

Cron is the name of program that enables unix users to execute commands
or scripts (groups of commands) automatically at a specified time/date.
It is normally used for sys admin commands, like makewhatis, which builds a 
search database for the man -k command, or for running a backup script, 
but can be used for anything. A common use for it today is connecting to 
the internet and downloading your email.

Overview of HBase





  • A non-relational (NoSQL) database that runs on top of HDFS
  • Is an open source NoSQL database that provides real-time read/write access to those large datasets.
  • Scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas. 
  • Is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN

Permanent UDF in hive


If we create a temporary function then it would be available for the current cli session.
Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';


HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.  
Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same. 

Create UDF functions in hive



HIVE UDF FUNCTIONS
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical and relational on the operands of table column names.

We can write the UDF function in java as shown below.
In this example,  we are replacing a character string into "*" . We are masking characters which should not be shown to the user.

Eclipse market place for scala


We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED
  • In the eclipse IDE we need to go the help tab and click on Eclipse marketplace


Linux file system hierarchy


If you look at the Linux file hierarchy, you find the following :

/bin - Common binaries

/sbin - Binaries used for system administration are placed here.

/boot - Static files of the boot loader. Usually it contain the Linux kernel, Grub boot loader files and so on.

/dev - Device files such as your CD drive, hard disk, and any other physical device. 

In Linux/Unix, the common premise is that everything is a file.

/home - User HOME directories are found here. In unices like FreeBSD, the HOME directories are found in /usr/home. And in Solaris it is in /export. So quite a big difference here.

Install SBT using yum


sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

  • Native support for compiling Scala code and integrating with many Scala test frameworks
  • Continuous compilation, testing, and deployment
  • Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
  • Build descriptions written in Scala using a DSL


Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark


SPARK PROCESSING
  • Distributed and parallel processing
  • Each executor has separate copies (variables and functions)
  • No propagation data back to the driver (Except on certain necessary cases)

Wednesday, June 13, 2018

Simple demo for spark streaming















USING PYSPARK

1. Login to your shell and open pyspark
[cloudera@quickstart ~]$ pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) 


2. Run Netcat
[cloudera@quickstart ~]$  nc -l localhost 2182