Web Snippets: June 2018

Sunday, June 24, 2018

Simple transformations in Spark

MAP :

map is a transformation operation in Spark hence it is lazily evaluated
It is a narrow operation as it is not shuffling data from one partition to multiple partitions

Note: In this example map takes the values from the array and assigns 1 to all the elements in the array

scala> val x=sc.parallelize(List("spark","rdd","example","sample","example"),3)
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4]

at parallelize at <console>:27

scala> val y=x.map(x=>(x,1))
y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5]

at map at <console>:29

scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1),

(sample,1), (example,1))


What is cron?
Cron is the name of program that enables unix users to execute commands

or scripts (groups of commands) automatically at a specified time/date.
It is normally used for sys admin commands, like makewhatis, which builds a 
search database for the man -k command, or for running a backup script,

but can be used for anything. A common use for it today is connecting to

the internet and downloading your email.

A non-relational (NoSQL) database that runs on top of HDFS
Is an open source NoSQL database that provides real-time read/write access to those large datasets.
Scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas.
Is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN

If we create a temporary function then it would be available for the current cli session.

Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';

HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.

Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same.

HIVE UDF FUNCTIONS
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical and relational on the operands of table column names.

We can write the UDF function in java as shown below.
In this example, we are replacing a character string into "*" . We are masking characters which should not be shown to the user.

We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED

In the eclipse IDE we need to go the help tab and click on Eclipse marketplace

If you look at the Linux file hierarchy, you find the following :

/bin - Common binaries

/sbin - Binaries used for system administration are placed here.

/boot - Static files of the boot loader. Usually it contain the Linux kernel, Grub boot loader files and so on.

/dev - Device files such as your CD drive, hard disk, and any other physical device.

In Linux/Unix, the common premise is that everything is a file.

/home - User HOME directories are found here. In unices like FreeBSD, the HOME directories are found in /usr/home. And in Solaris it is in /export. So quite a big difference here.

sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

Native support for compiling Scala code and integrating with many Scala test frameworks
Continuous compilation, testing, and deployment
Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
Build descriptions written in Scala using a DSL

Accumulators and Broadcast Variables in Spark

SPARK PROCESSING

Distributed and parallel processing
Each executor has separate copies (variables and functions)
No propagation data back to the driver (Except on certain necessary cases)

Simple demo for spark streaming

USING PYSPARK

1. Login to your shell and open pyspark

[cloudera@quickstart ~]$ pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)

2. Run Netcat

[cloudera@quickstart ~]$  nc -l localhost 2182

Web Snippets

Labels

Sunday, June 24, 2018

Simple transformations in Spark

Running cron Job

What is cron?

Overview of HBase

Permanent UDF in hive

Create UDF functions in hive

Eclipse market place for scala

Linux file system hierarchy

Install SBT using yum

Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark

Wednesday, June 13, 2018

Simple demo for spark streaming

Labels

Blog Archive