Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Monday, February 4, 2019

SSH to Hortonworks sandbox



1) Download the sandbox for hortonworks
2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)


HORTONWORKS SANDBOX

Sunday, April 22, 2018

Overview of Flume



Flume


  • Distributed data collection service
  • Gets streaming event data from different sources
  • Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data

Tuesday, February 27, 2018

Apache ZooKeeper


APACHE KAFKA DISTRIBUTED ARCHITECTURE

  • At the heart of Apache kafka we have a cluster, which consists of hundreds of independent Brokers.
  • Closely associated with the kafka cluster, we have a Zookeeper environment,which provides the Brokers within a cluster, the metadata it needs to operate at scale and reliability.As this metadata is constantly changing, connectivity and chatter between the cluster members and Zookeeper is required.


Team formation in Kafka


CONTROLLER ELECTION
  • Hierarchy starts with a controller/supervisor 
  • It is a worker node elected by its peers to officiate in the administrative capacity of a controller
  • The worker node selected as controller is the one that is been around the longest

RESPONSIBILITY OF CONTROLLER ELECTION
  • Maintain inventory of what workers are available to take on work.
  • Maintain a list of work items that has been committed to and assigned to workers
  • Maintain active status of the staff and their progress on assigned tasks

Overview of kafka



  • Apache Kafka is a distributed commit log service
  • Functions much like a publish/subscribe messaging system
  • Better throughput
  • Built-in partitioning, replication, and fault tolerance. 
  • Increasingly popular for log collection and stream processing.

Wednesday, February 21, 2018

Types of data store


After collecting the data we  need to store the data in the data store.There are different types of data store.

Types of data store 

In memory : Caches, data structure servers
Database    : SQL & NoSQL databases
Search        : Search engines
File Store    : File systems
Queue          : Message queues
Stream storage: pub/sub message queues

Temperature of Big Data



What is data temperature?


  •  It’s classifying data from hot to cold based on how frequently it is accessed. 
  • Hot data is accessed most frequently and cold data is accessed infrequently. 
       Hot Data
    • Measurements in large-scale analytic environments consistently indicate that less than 20% of the data is accessed by more than 90% of the I/Os in an analytic environment. Such data belongs in memory so we can retrieve it very fast.

Types of data


There are 3 types of data 

Transactions


  • The source of these data are usually from mobile apps, web apps, data centers.
  • These data are structured and are received as records from RDBMS like MySQL, Oracle etc
    • They could also be received from In-memory data structures.
  • We can receive these data from our data centers.

Tuesday, February 20, 2018

Performance in Hive

Performance can in hive can be achieved by 

  1. PARTITIONING

  •  Logically break up data
  •   Anytime a new value id added to a column, It doesn't match any of the existing
       partitions new partitions are created       

Friday, February 9, 2018

Spark Architecture


Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Data Representation in RDD


Spark has 3 data representation


  1. RDD(Resilient Distributed Database) 

    • Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing. 
    • It is also fault tolerant collection of elements, which means it can automatically recover from failures. 
    • Is immutable, we can create RDD once but can’t change it.


Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways
  1. Read a file: Individual rows or records become an element in the RDD
  2. Transform another RDD:

Apache Spark RDD


RDD (RESILIENT DISTRIBUTED DATASETS)

  • Basic program abstraction in Spark
  • All operations are performed in memory objects
  • Collection of entities
  • It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

Overview of Spark streaming

How much data does google  deal?
  • Stores about 15 exabytes ( 1000000000000000000B )of data 
  • Process 100 petabytes of data per day
  • 60 trillion pages are indexed
  • 1 billion google search users per month
Note: Fraud detection is an example of real time processing

Limitations of Map reduce
  • Entire Map reduce job is a batch processing job
  • Does not allow real time processing of the data

Flavors of Hadoop Distribution


Hortonworks 

  • It is very similar to the Apache Hadoop distribution. 
  • We can use Azure blob storage as the default DFS. With that, we can start the cluster only when we need to compute power. 
  • We can bring data to the storage through REST API, or SDKs in different languages rest of the time. Therefore we can create a cluster that has the required size when we want the computation. There is a lot of flexibility but we will lose collocality (which is mainly important in the first map phase).

Wednesday, January 31, 2018

Masking PII data using Hive


Hive table creation

Create table for import data with fields with CSV
hive> create table Account(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create table for secured account where PI column would be masked
hive> create table Accountmasked(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create contact table 
hive> create table contact(id int,accountid int,firstname string,lastName string,
phone string,email string) 
row format delimited 
fields terminated by ','; 


Friday, January 26, 2018

Benefits of YARN ( Hadoop version 2.0 )


The 5 key Benefits of YARN

  • New Applications and services
       

  • Improved cluster utilization
    • Generic resource container model replaces fixed Map/Reduce slots.
    • Sharing clusters across multiple applications     

Limitations of Hadoop Version 1


Limitations of Hadoop 1

Scalability 
  • Max cluster size ~5000 nodes
  • Max concurrent tasks ~40,000
  • Coarse Synchronization in JobTracker

Yarn Architecture


Hadoop version 2 came with a fundamental change to the architecture.The framework was divided into two. Mapreduce and Yarn

MapReduce: Responsible for what operations you want to perform on the data

YARN: Yet Another Resource Negotiator
  • Determines and responsible for coordinating all the tasks running on all the nodes in the cluster
  • Framework responsible for providing the computational resources which includes ( CPU, memory,etc) needed for application execution
  • Assigns new task to the node based on the existing capacity. If nodes have failed and all the process in that nodes have stopped, it would assign new nodes for that task
  • It is a better resource negotiator

Map Reduce Data Flow


Pre loaded local input data and Mapping
  • MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.
  • Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.
  • Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.
  • Each mapper loads the set of file local to that machine and process them.