Web Snippets: Hadoop

Showing posts with label Hadoop. Show all posts

Monday, February 4, 2019

SSH to Hortonworks sandbox

1) Download the sandbox for hortonworks
2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)

HORTONWORKS SANDBOX

Flume

Distributed data collection service
Gets streaming event data from different sources
Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data

Apache ZooKeeper

APACHE KAFKA DISTRIBUTED ARCHITECTURE

At the heart of Apache kafka we have a cluster, which consists of hundreds of independent Brokers.
Closely associated with the kafka cluster, we have a Zookeeper environment,which provides the Brokers within a cluster, the metadata it needs to operate at scale and reliability.As this metadata is constantly changing, connectivity and chatter between the cluster members and Zookeeper is required.

CONTROLLER ELECTION

Hierarchy starts with a controller/supervisor
It is a worker node elected by its peers to officiate in the administrative capacity of a controller
The worker node selected as controller is the one that is been around the longest

RESPONSIBILITY OF CONTROLLER ELECTION

Maintain inventory of what workers are available to take on work.
Maintain a list of work items that has been committed to and assigned to workers
Maintain active status of the staff and their progress on assigned tasks

Apache Kafka is a distributed commit log service
Functions much like a publish/subscribe messaging system
Better throughput
Built-in partitioning, replication, and fault tolerance.
Increasingly popular for log collection and stream processing.

Types of data store

After collecting the data we need to store the data in the data store.There are different types of data store.

Types of data store

In memory : Caches, data structure servers
Database : SQL & NoSQL databases
Search : Search engines
File Store : File systems
Queue : Message queues
Stream storage: pub/sub message queues

What is data temperature?

It’s classifying data from hot to cold based on how frequently it is accessed.
Hot data is accessed most frequently and cold data is accessed infrequently.

Hot Data

Measurements in large-scale analytic environments consistently indicate that less than 20% of the data is accessed by more than 90% of the I/Os in an analytic environment. Such data belongs in memory so we can retrieve it very fast.

There are 3 types of data

Transactions

The source of these data are usually from mobile apps, web apps, data centers.
These data are structured and are received as records from RDBMS like MySQL, Oracle etc

They could also be received from In-memory data structures.

We can receive these data from our data centers.

Performance in Hive

Performance can in hive can be achieved by

PARTITIONING

Logically break up data
Anytime a new value id added to a column, It doesn't match any of the existing
partitions new partitions are created

Spark Architecture

Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Spark has 3 data representation

RDD(Resilient Distributed Database)

Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing.
It is also fault tolerant collection of elements, which means it can automatically recover from failures.
Is immutable, we can create RDD once but can’t change it.

Creation of an RDD

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

RDD (RESILIENT DISTRIBUTED DATASETS)

Basic program abstraction in Spark
All operations are performed in memory objects
Collection of entities
It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

How much data does google deal?

Stores about 15 exabytes ( 1000000000000000000B )of data
Process 100 petabytes of data per day
60 trillion pages are indexed
1 billion google search users per month

Note: Fraud detection is an example of real time processing

Limitations of Map reduce

Entire Map reduce job is a batch processing job
Does not allow real time processing of the data

Hortonworks

It is very similar to the Apache Hadoop distribution.
We can use Azure blob storage as the default DFS. With that, we can start the cluster only when we need to compute power.
We can bring data to the storage through REST API, or SDKs in different languages rest of the time. Therefore we can create a cluster that has the required size when we want the computation. There is a lot of flexibility but we will lose collocality (which is mainly important in the first map phase).

Masking PII data using Hive

Hive table creation

Create table for import data with fields with CSV

hive> create table Account(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create table for secured account where PI column would be masked

hive> create table Accountmasked(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create contact table

hive> create table contact(id int,accountid int,firstname string,lastName string,
phone string,email string) 
row format delimited 
fields terminated by ',';

Benefits of YARN ( Hadoop version 2.0 )

The 5 key Benefits of YARN

New Applications and services

Improved cluster utilization

Generic resource container model replaces fixed Map/Reduce slots.
Sharing clusters across multiple applications

Limitations of Hadoop 1

Scalability

Max cluster size ~5000 nodes
Max concurrent tasks ~40,000
Coarse Synchronization in JobTracker

Hadoop version 2 came with a fundamental change to the architecture.The framework was divided into two. Mapreduce and Yarn

MapReduce: Responsible for what operations you want to perform on the data

YARN: Yet Another Resource Negotiator

Determines and responsible for coordinating all the tasks running on all the nodes in the cluster
Framework responsible for providing the computational resources which includes ( CPU, memory,etc) needed for application execution
Assigns new task to the node based on the existing capacity. If nodes have failed and all the process in that nodes have stopped, it would assign new nodes for that task
It is a better resource negotiator

Pre loaded local input data and Mapping

MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.
Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.
Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.
Each mapper loads the set of file local to that machine and process them.

Labels

Monday, February 4, 2019

Sunday, April 22, 2018

Flume

Tuesday, February 27, 2018

Wednesday, February 21, 2018

Types of data store

What is data temperature?

There are 3 types of data

Transactions

Tuesday, February 20, 2018

Performance can in hive can be achieved by

PARTITIONING

Friday, February 9, 2018

Steps explained

Spark has 3 data representation

Creation of an RDD

RDD's can be created in 2 ways Read a file: Individual rows or records become an element in the RDD Transform another RDD:

RDD (RESILIENT DISTRIBUTED DATASETS)

Hortonworks

Wednesday, January 31, 2018

Hive table creation

Friday, January 26, 2018

The 5 key Benefits of YARN

Limitations of Hadoop 1

Labels

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD: