Web Snippets: January 2018

Wednesday, January 31, 2018

Masking PII data using Hive

Hive table creation

Create table for import data with fields with CSV

hive> create table Account(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create table for secured account where PI column would be masked

hive> create table Accountmasked(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create contact table

hive> create table contact(id int,accountid int,firstname string,lastName string,
phone string,email string) 
row format delimited 
fields terminated by ',';

Benefits of YARN ( Hadoop version 2.0 )

The 5 key Benefits of YARN

New Applications and services

Improved cluster utilization

Generic resource container model replaces fixed Map/Reduce slots.
Sharing clusters across multiple applications

Limitations of Hadoop 1

Scalability

Max cluster size ~5000 nodes
Max concurrent tasks ~40,000
Coarse Synchronization in JobTracker

Hadoop version 2 came with a fundamental change to the architecture.The framework was divided into two. Mapreduce and Yarn

MapReduce: Responsible for what operations you want to perform on the data

YARN: Yet Another Resource Negotiator

Determines and responsible for coordinating all the tasks running on all the nodes in the cluster
Framework responsible for providing the computational resources which includes ( CPU, memory,etc) needed for application execution
Assigns new task to the node based on the existing capacity. If nodes have failed and all the process in that nodes have stopped, it would assign new nodes for that task
It is a better resource negotiator

Pre loaded local input data and Mapping

MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.
Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.
Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.
Each mapper loads the set of file local to that machine and process them.

Parallel Execution

Extremely good at high-volume batch processing because of the ability to do parallel processing.
Can perform 10 times faster than on a single thread server or on mainframe

Data Locality

Data is not moved .
Processing data where it resides

Note: This is the ideal choice, however it might not be possible to always achieve data locality

Sqoop commands:

Connect and list all databases

Ex: sqoop list –databases \ 
       >--connect jdbc:mysql://quickstart.cloudera\ 
       >--username root –password cloudera

List all tables in a specified database

sqoop list-tables --connect jdbc:mysql://quickstart.cloudera/retail_db

            --username root --password cloudera

Start the hive shell

hive

Create schema in hive

hive> create schema hiveschema location '/hivedatabase/';

Create table with location

use hiveschema;
create external table employee(emp_id int,emp_name string,emp_phone string)
row format delimited
fields terminated by '\t'
location '/hivedatabase/employee';

Map Reduce has 2 phases

Input to each functions are key value pairs
Map is a Mapper function and Reduce is the Reducer function

Mapper phase

First phase in the execution of map-reduce program.
Data in each split is passed to a mapping function to produce output values.
Several Map tasks are executed

What is rack awareness?

In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file, namenode chooses the datanode which is closer to the same rack or nearby rack to Read/Write request.
Namenode achieves rack information by maintaining the rack id’s of each datanode. This concept that chooses closer datanodes based on the rack information is called Rack Awareness in Hadoop.
Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster.
Default Hadoop installation assumes that all data nodes belong to the same rack.

Login to MySQL

$ mysql –u root –p

Password: cloudera
(password will not visible but we have to type)

Show databases;

mysql> show databases;

Use Database

 $ use <database name> Ex:$ use retail_db;

Exit from mysql

mysql> quit;

Working with Database

Create database

create database [If not exists] testdatabase;

Drop database

drop database [If exists] testdatabase;

Create table or temporary table

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] table(
   key type(size) NOT NULL PRIMARY KEY AUTO_INCREMENT,
   c1 type(size) NOT NULL,
   c2 type(size) NULL,
   ...
);

We can create a temporary function in hive and use the function in our query, however this is valid till the current session is alive.
If we want the function to be permanent we need to create a permanent function.

Steps involved to create a permanent function in Hive

5 Daemons of Hadoop

Hadoop is comprised of five separate daemons. Each of these daemon runs in its own JVM.

Daemons run on Master nodes:

NameNode – This daemon stores and maintains the metadata for HDFS.
Secondary NameNode – Performs housekeeping functions for the NameNode.
JobTracker – Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

The Apache Hadoop project actively supports multiple projects intended to extend Hadoop’s capabilities and make it easier to use.
There are several top-level projects to create development tools as well as for managing Hadoop data flow and processing

Data Ingestion

Flume :A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
Kafka :A messaging broker that is often used in place of traditional brokers in the Hadoop environment because it is designed for higher throughput and provides replication and greater fault tolerance.
SQOOP : Is a tool designed to transfer data between Hadoop and relational database servers like MySQL or Oracle

Creates a list of elements for which a function returns true. Here is a short and concise example:

number_list = range(-5, 5)
less_than_zero = list(filter(lambda x: x < 0, number_list))
print(less_than_zero)

List comprehension is an elegant way to define and create list in Python.
These lists have often the qualities of sets, but are not in all cases sets.
List comprehension is a complete substitute for the lambda function as well as the functions map(), filter() and reduce().
Syntax of list comprehension is easier to be grasped.

There are following logical operators supported by Python language

Operator	Description	Example
and Logical AND	If both the operands are true then condition becomes true.	(a and b) is true.
or Logical OR	If any of the two operands are non-zero then condition becomes true.	(a or b) is true.
not Logical NOT	Used to reverse the logical state of its operand.	Not(a and b) is false.

The if-then construct (sometimes called if-then-else) is common across many programming languages, but the syntax varies from language to language.

The general form of the if statement in Python looks like this:

if condition_1:
    statement_block_1
elif condition_2:
    statement_block_2
else:
    statement_block_3

Python Comparison Operators These operators compare the values on either sides of them and decide the relation among them. They are also called Relational operators.

There are two types of loops in Python, for and while.
For loops iterate over a given sequence.

Note: For loops can iterate over a sequence of numbers using the "range"
Here is an example:

Sets in Python

Unique set of collections
Looks the same as dictionary with the curly braces { }. It does not have { ' ',' ' }
If duplicates are discarded and not added to the collections

Tuples are sequence of immutable objects
They do not support item assignment
They are created using ()

Example

tup1 = ('physics', 'chemistry', 1997, 2000)
tup2 = (1, 2, 3, 4, 5 )
tup3 = "a", "b", "c", "d"

tup1 =(); //Empty tuples

Created by using bracket
Key value pairs
Each key is separated from its value by a colon (:), the items are separated by commas, and the whole thing is enclosed in curly braces
Do not have any order

Accessing Values in Dictionary

To access dictionary elements, you can use the familiar square brackets along with the key to obtain its value.

#!/usr/bin/python3

dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
print ("dict['Name']: ", dict['Name'])
print ("dict['Age']: ", dict['Age'])

Python Lists

The list is the most versatile datatype available in Python, which can be written as a list of comma-separated values (items) between square brackets.
The items in a list need not be of the same type.

List is a collection which is ordered and changeable. Allows duplicate members.
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
Set is a collection which is unordered and unindexed. No duplicate members.
Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.

Creating a list is as simple as putting different comma-separated values between square brackets. For example −

list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5 ];
list3 = ["a", "b", "c", "d"];

Operator	Description	Example
+ Addition	Adds values on either side of the operator.	a + b = 31
- Subtraction	Subtracts right hand operand from left hand operand.	a – b = -11
* Multiplication	Multiplies values on either side of the operator	a * b = 210
/ Division	Divides left hand operand by right hand operand	b / a = 2.1
% Modulus	Divides left hand operand by right hand operand and returns remainder	b % a = 1
** Exponent	Performs exponential (power) calculation on operators	a**b =10 to the power 20
//	Floor Division - The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from zero (towards negative infinity):	9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.0//3

list.append(x) Add an item to the end of the list; equivalent to a[len(a):] = [x].

list.extend(L) Extend the list by appending all the items in the given list; equivalent to a[len(a):] = L.

list.insert(i, x) Insert an item at a given position. The first argument is the index of the element before which to insert, so a.insert(0, x) inserts at the front of the list, and a.insert(len(a), x) is equivalent to a.append(x).

list.remove(x) Remove the first item from the list whose value is x. It is an error if there is no such item.

Is dynamically typed
Uses duck typing
General purpose language and can be used in any particular domain or environment
Interpretative language
Clear readable and expressive

Lambda and map in Python

Lambda:
Takes a number of parameters and an expression combining these parameters, and creates an anonymous function that returns the value of the expression:

Ex: :

adder = lambda x, y: x+y

print_assign = lambda name, value: name + '=' + str(value)

Map:
map() is a function which takes two arguments

7 v's of big data

Big data can be define with with 7 V's

Volume

Volume is how much data we have – what used to be measured in Gigabytes is now measured in Zettabytes (ZB) or even Yottabytes (YB)

Create virtual machine instance

Specify the location for the virtual box

Quick review of machine learning algorithms

These are some of the important machine learning algorithms

Decision tree

Belongs to the family of supervised learning algorithms.
Can be used for solving regression and classification problems too.The general motive of using
Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data)

Ex : Banker deciding whether to grant a loan.

We can easily get stack exchange dump from https://archive.org/download/stackexchange
Dump data might be too large.

While downloading we might also need 7z zip

Labels