Wednesday, January 31, 2018

Masking PII data using Hive


Hive table creation

Create table for import data with fields with CSV
hive> create table Account(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create table for secured account where PI column would be masked
hive> create table Accountmasked(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create contact table 
hive> create table contact(id int,accountid int,firstname string,lastName string,
phone string,email string) 
row format delimited 
fields terminated by ','; 


Friday, January 26, 2018

Benefits of YARN ( Hadoop version 2.0 )


The 5 key Benefits of YARN

  • New Applications and services
       

  • Improved cluster utilization
    • Generic resource container model replaces fixed Map/Reduce slots.
    • Sharing clusters across multiple applications     

Limitations of Hadoop Version 1


Limitations of Hadoop 1

Scalability 
  • Max cluster size ~5000 nodes
  • Max concurrent tasks ~40,000
  • Coarse Synchronization in JobTracker

Yarn Architecture


Hadoop version 2 came with a fundamental change to the architecture.The framework was divided into two. Mapreduce and Yarn

MapReduce: Responsible for what operations you want to perform on the data

YARN: Yet Another Resource Negotiator
  • Determines and responsible for coordinating all the tasks running on all the nodes in the cluster
  • Framework responsible for providing the computational resources which includes ( CPU, memory,etc) needed for application execution
  • Assigns new task to the node based on the existing capacity. If nodes have failed and all the process in that nodes have stopped, it would assign new nodes for that task
  • It is a better resource negotiator

Map Reduce Data Flow


Pre loaded local input data and Mapping
  • MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.
  • Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.
  • Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.
  • Each mapper loads the set of file local to that machine and process them.

Features of Hadoop




Parallel Execution
  • Extremely good at high-volume batch processing because of the ability to do parallel processing.
  • Can perform 10 times faster than on a single thread server or on mainframe
Data Locality
  • Data is not moved .
  • Processing data where it resides
      Note: This is the ideal choice, however it might not be possible to always achieve data locality

SQOOP Cheat Sheet

Sqoop commands:

Connect and list all databases
Ex: sqoop list –databases \ 
       >--connect jdbc:mysql://quickstart.cloudera\ 
       >--username root –password cloudera

 List all tables in a specified database
sqoop list-tables --connect jdbc:mysql://quickstart.cloudera/retail_db  
            --username root --password cloudera

Hive Cheat Sheet



Start the hive shell
hive

Create schema in hive
hive> create schema hiveschema location '/hivedatabase/';

Create table with location
use hiveschema;
create external table employee(emp_id int,emp_name string,emp_phone string)
row format delimited
fields terminated by '\t'
location '/hivedatabase/employee';


Phases in Map Reduce


Map Reduce has 2 phases


  • Input to each functions are key value pairs
  • Map is a Mapper function and Reduce is the Reducer function

Mapper phase 

  • First phase in the execution of map-reduce program.
  • Data in each split is passed to a mapping function to produce output values.
  • Several Map tasks are executed


HDFS Rack Awareness


What is rack awareness?


  • In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file, namenode chooses the datanode which is closer to the same rack or nearby rack to Read/Write request. 
  • Namenode achieves rack information by maintaining the rack id’s of each datanode. This concept that chooses closer datanodes based on the rack information is called Rack Awareness in Hadoop.
  • Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster. 
  • Default Hadoop installation assumes that all data nodes belong to the same rack.

MySql Cheat Sheet



Login to MySQL
$ mysql –u root –p 
Password: cloudera 
(password will not visible but we have to type)

Show databases;
mysql> show databases;

Use Database
 $ use <database name> Ex:$ use retail_db;

Exit from mysql
mysql> quit;

Working with Database

Create database
create database [If not exists] testdatabase;

Drop database
drop database [If exists] testdatabase;

Create table or temporary table

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] table(
   key type(size) NOT NULL PRIMARY KEY AUTO_INCREMENT,
   c1 type(size) NOT NULL,
   c2 type(size) NULL,
   ...
);


Permanent User Defined Function (UDF) in Hive



  • We can create a temporary function in hive and use the function in our query, however this is valid till the current session is alive. 
  • If we want the function to be permanent we need to create a permanent function.


Steps involved to create a permanent function in Hive

Thursday, January 18, 2018

5 Daemons of Hadoop




Hadoop is comprised of five separate daemons. Each of these daemon runs in its own JVM.


Daemons run on Master nodes:

NameNode – This daemon stores and maintains the metadata for HDFS.
Secondary NameNode – Performs housekeeping functions for the NameNode.
JobTracker – Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Hadoop eco system




  • The Apache Hadoop project actively supports multiple projects intended to extend Hadoop’s capabilities and make it easier to use. 
  • There are several top-level projects to create development tools as well as for managing Hadoop data flow and processing

Data Ingestion

Flume  :A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
Kafka  :A messaging broker that is often used in place of traditional brokers in the Hadoop environment because it is designed for higher throughput and provides replication and greater fault tolerance.
SQOOP : Is a tool designed to transfer data between Hadoop and relational database servers like  MySQL or Oracle

Using filters in Python



Creates a list of elements for which a function returns true. Here is a short and concise example:

number_list = range(-5, 5)
less_than_zero = list(filter(lambda x: x < 0, number_list))
print(less_than_zero)


List comprehension in Python

  • List comprehension is an elegant way to define and create list in Python. 
  • These lists have often the qualities of sets, but are not in all cases sets. 
  •  List comprehension is a complete substitute for the lambda function as well as the functions map(), filter() and reduce(). 
  •  Syntax of list comprehension is easier to be grasped.

Logical operators in Python


There are following logical operators supported by Python language
OperatorDescriptionExample
and Logical ANDIf both the operands are true then condition becomes true.(a and b) is true.
or Logical ORIf any of the two operands are non-zero then condition becomes true.(a or b) is true.
not Logical NOTUsed to reverse the logical state of its operand.Not(a and b) is false.

Conditional statements in Python


The if-then construct (sometimes called if-then-else) is common across many programming languages, but the syntax varies from language to language.

The general form of the if statement in Python looks like this:

if condition_1:
    statement_block_1
elif condition_2:
    statement_block_2
else:
    statement_block_3

Comparison operators in Python


Python Comparison Operators These operators compare the values on either sides of them and decide the relation among them. They are also called Relational operators.

Loops in Python



  • There are two types of loops in Python, for and while. 
  •  For loops iterate over a given sequence. 

Note: For loops can iterate over a sequence of numbers using the "range"
Here is an example:

Wednesday, January 17, 2018

Sets in Python



  • Unique set of collections
  • Looks the same as dictionary with the curly braces { }. It does not have { ' ',' ' }
  • If  duplicates are discarded and not added to the collections

Tuples in Python


  • Tuples are sequence of immutable objects
  • They do not support item assignment
  • They are created using ()
Example
tup1 = ('physics', 'chemistry', 1997, 2000)
tup2 = (1, 2, 3, 4, 5 )
tup3 = "a", "b", "c", "d"
tup1 =(); //Empty tuples

Dictionary in Python


  • Created by using bracket
  • Key value pairs
  • Each key is separated from its value by a colon (:), the items are separated by commas, and the whole thing is enclosed in curly braces
  • Do not have any  order
Accessing Values in Dictionary
To access dictionary elements, you can use the familiar square brackets along with the key to obtain its value. 
#!/usr/bin/python3

dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
print ("dict['Name']: ", dict['Name'])
print ("dict['Age']: ", dict['Age'])

Lists in Python



Python Lists

  • The list is the most versatile datatype available in Python, which can be written as a list of comma-separated values (items) between square brackets. 
  • The items in a list need not be of the same type.
List is a collection which is ordered and changeable. Allows duplicate members.
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
Set is a collection which is unordered and unindexed. No duplicate members.
Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.

Creating a list is as simple as putting different comma-separated values between square brackets. For example −
list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5 ];
list3 = ["a", "b", "c", "d"];

Arithmetic operators in Python



OperatorDescriptionExample
+ AdditionAdds values on either side of the operator.a + b = 31
- SubtractionSubtracts right hand operand from left hand operand.a – b = -11
* MultiplicationMultiplies values on either side of the operatora * b = 210
/ DivisionDivides left hand operand by right hand operandb / a = 2.1
% ModulusDivides left hand operand by right hand operand and returns remainderb % a = 1
** ExponentPerforms exponential (power) calculation on operatorsa**b =10 to the power 20
//Floor Division - The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored, i.e., rounded away from zero (towards negative infinity):9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.0//3 

Methods for list object (pop, append..) in Python



list.append(x) Add an item to the end of the list; equivalent to a[len(a):] = [x].

list.extend(L) Extend the list by appending all the items in the given list; equivalent to a[len(a):] = L. 

list.insert(i, x) Insert an item at a given position. The first argument is the index of the element before which to insert, so a.insert(0, x) inserts at the front of the list, and a.insert(len(a), x) is equivalent to a.append(x).

list.remove(x) Remove the first item from the list whose value is x. It is an error if there is no such item.

Overview of Python

  • Is dynamically typed
  • Uses duck typing
  • General purpose language and can be used in any particular domain or environment
  • Interpretative language
  • Clear readable and expressive






Lambda and map in Python


Lambda:
Takes a number of parameters and an expression combining these parameters, and creates an anonymous function that returns the value of the expression:

Ex: :
adder = lambda x, y: x+y

print_assign = lambda name, value: name + '=' + str(value)

Map:
map() is a function which takes two arguments

Friday, January 12, 2018

7 v's of big data


Big data can be define with with 7 V's

Volume 

Volume is how much data we have – what used to be measured in Gigabytes is now measured in Zettabytes (ZB) or even Yottabytes (YB)

Thursday, January 11, 2018

Create virtual machine instance


  1.  Specify the location for the virtual box

Quick review of machine learning algorithms

These are some of the important machine learning algorithms

Decision tree

  •  Belongs to the family of supervised learning algorithms. 
  • Can be used for solving regression and classification problems too.The general motive of using
  • Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data)
       Ex : Banker deciding whether to grant a loan.

Stack exchange dump


We can easily get stack exchange dump from https://archive.org/download/stackexchange
Dump data might be too large.


While downloading we might also need 7z zip