Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts
Monday, February 4, 2019
Sunday, April 22, 2018
Tuesday, February 27, 2018
Apache ZooKeeper
- At the heart of Apache kafka we have a cluster, which consists of hundreds of independent Brokers.
- Closely associated with the kafka cluster, we have a Zookeeper environment,which provides the Brokers within a cluster, the metadata it needs to operate at scale and reliability.As this metadata is constantly changing, connectivity and chatter between the cluster members and Zookeeper is required.
Team formation in Kafka
CONTROLLER ELECTION
- Hierarchy starts with a controller/supervisor
- It is a worker node elected by its peers to officiate in the administrative capacity of a controller
- The worker node selected as controller is the one that is been around the longest
RESPONSIBILITY OF CONTROLLER ELECTION
- Maintain inventory of what workers are available to take on work.
- Maintain a list of work items that has been committed to and assigned to workers
- Maintain active status of the staff and their progress on assigned tasks
Wednesday, February 21, 2018
Types of data store
After collecting the data we need to store the data in the data store.There are different types of data store.
Types of data store
In memory : Caches, data structure serversDatabase : SQL & NoSQL databases
Search : Search engines
File Store : File systems
Queue : Message queues
Stream storage: pub/sub message queues
Temperature of Big Data
What is data temperature?
- It’s classifying data from hot to cold based on how frequently it is accessed.
- Hot data is accessed most frequently and cold data is accessed infrequently.
- Measurements in large-scale analytic environments consistently indicate that less than 20% of the data is accessed by more than 90% of the I/Os in an analytic environment. Such data belongs in memory so we can retrieve it very fast.
Types of data
There are 3 types of data
Transactions
- The source of these data are usually from mobile apps, web apps, data centers.
- These data are structured and are received as records from RDBMS like MySQL, Oracle etc
- They could also be received from In-memory data structures.
- We can receive these data from our data centers.
Tuesday, February 20, 2018
Friday, February 9, 2018
Data Representation in RDD
Spark has 3 data representation
- RDD(Resilient Distributed Database)
- Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing.
- It is also fault tolerant collection of elements, which means it can automatically recover from failures.
- Is immutable, we can create RDD once but can’t change it.
Overview of Spark streaming
How much data does google deal?
Limitations of Map reduce
- Stores about 15 exabytes ( 1000000000000000000B )of data
- Process 100 petabytes of data per day
- 60 trillion pages are indexed
- 1 billion google search users per month
Limitations of Map reduce
- Entire Map reduce job is a batch processing job
- Does not allow real time processing of the data
Flavors of Hadoop Distribution
Hortonworks
- It is very similar to the Apache Hadoop distribution.
- We can use Azure blob storage as the default DFS. With that, we can start the cluster only when we need to compute power.
- We can bring data to the storage through REST API, or SDKs in different languages rest of the time. Therefore we can create a cluster that has the required size when we want the computation. There is a lot of flexibility but we will lose collocality (which is mainly important in the first map phase).
Wednesday, January 31, 2018
Masking PII data using Hive
Hive table creation
Create table for import data with fields with CSVhive> create table Account(id int,name string,phone string) row format delimited fields terminated by ',';
Create table for secured account where PI column would be masked
hive> create table Accountmasked(id int,name string,phone string) row format delimited fields terminated by ',';
Create contact table
hive> create table contact(id int,accountid int,firstname string,lastName string, phone string,email string) row format delimited fields terminated by ',';
Friday, January 26, 2018
Yarn Architecture
Hadoop version 2 came with a fundamental change to the architecture.The framework was divided into two. Mapreduce and Yarn
MapReduce: Responsible for what operations you want to perform on the data
YARN: Yet Another Resource Negotiator
- Determines and responsible for coordinating all the tasks running on all the nodes in the cluster
- Framework responsible for providing the computational resources which includes ( CPU, memory,etc) needed for application execution
- Assigns new task to the node based on the existing capacity. If nodes have failed and all the process in that nodes have stopped, it would assign new nodes for that task
- It is a better resource negotiator
Map Reduce Data Flow
Pre loaded local input data and Mapping
- MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.
- Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.
- Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.
- Each mapper loads the set of file local to that machine and process them.
Subscribe to:
Posts (Atom)
Labels
- Algorithms (52)
- Apache Kafka (7)
- Apache Spark (21)
- Architecture (8)
- Arrays (23)
- Big Data (98)
- Cloud services (6)
- Cognitive technologies (12)
- Data Analytics (3)
- Data Science (6)
- Design (1)
- Hadoop (26)
- Hive (11)
- Java (2)
- JavaScript (65)
- JavaScript Run-time (12)
- Machine learning (11)
- Maths (6)
- MySQL (1)
- Networking (3)
- No SQL (2)
- Node (20)
- Python (28)
- SQL (40)
- Security (4)
- Spark Grpahx (1)
- Spark MLlib (1)
- Spark Sql (3)
- Spark Streaming (4)
- Sqoop (2)
- Strings (13)
- devOps (1)
- mongoDb (2)
- ssis (3)