Web Snippets: February 2018

Tuesday, February 27, 2018

Message Consumption in Apache kafka

MESSAGE OFFSET

Critical concept to understand because it is how consumers can read messages at their own pace and process them independently.
Place holder, It is like a bookmark that maintains the last read position
In the case of kafka topic, it is the last read message.
The offset is entirely established and maintained by the consumer.Since the consumer is entirely responsible for reading the messages and processing them on its own.
Keep track of what it has read and has not read
Offset refers to a message identifier

STEPS INVOLVED

When a consumer wishes to read from a topic, it must establish a connection with a Broker
After establishing the connection, the consumer will decide what messages it wants to consume
If the consumer has not previously read from the topic , or it has to start over, it will issue a statement to read from the beginning of the topic (Consumer establishing that its message offset for the topic is 0)

APACHE KAFKA DISTRIBUTED ARCHITECTURE

At the heart of Apache kafka we have a cluster, which consists of hundreds of independent Brokers.
Closely associated with the kafka cluster, we have a Zookeeper environment,which provides the Brokers within a cluster, the metadata it needs to operate at scale and reliability.As this metadata is constantly changing, connectivity and chatter between the cluster members and Zookeeper is required.

CONTROLLER ELECTION

Hierarchy starts with a controller/supervisor
It is a worker node elected by its peers to officiate in the administrative capacity of a controller
The worker node selected as controller is the one that is been around the longest

RESPONSIBILITY OF CONTROLLER ELECTION

Maintain inventory of what workers are available to take on work.
Maintain a list of work items that has been committed to and assigned to workers
Maintain active status of the staff and their progress on assigned tasks

Apache Kafka is a distributed commit log service
Functions much like a publish/subscribe messaging system
Better throughput
Built-in partitioning, replication, and fault tolerance.
Increasingly popular for log collection and stream processing.

Need for stream storage

Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Parallel consumption
Streaming Map Reduce

Amazon SQS

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications.
Building applications from individual components that each perform a discrete function improves scalability and reliability, and is best practice design for modern applications.

After collecting the data we need to store the data in the data store.There are different types of data store.

Types of data store

In memory : Caches, data structure servers
Database : SQL & NoSQL databases
Search : Search engines
File Store : File systems
Queue : Message queues
Stream storage: pub/sub message queues

What is data temperature?

It’s classifying data from hot to cold based on how frequently it is accessed.
Hot data is accessed most frequently and cold data is accessed infrequently.

Hot Data

Measurements in large-scale analytic environments consistently indicate that less than 20% of the data is accessed by more than 90% of the I/Os in an analytic environment. Such data belongs in memory so we can retrieve it very fast.

There are 3 types of data

Transactions

The source of these data are usually from mobile apps, web apps, data centers.
These data are structured and are received as records from RDBMS like MySQL, Oracle etc

They could also be received from In-memory data structures.

We can receive these data from our data centers.

Performance in Hive

Performance can in hive can be achieved by

PARTITIONING

Logically break up data
Anytime a new value id added to a column, It doesn't match any of the existing
partitions new partitions are created

Receivers in Spark Streaming

Task which collects input data from different sources
Spark allocates a receiver for each input source
Special task that run on the executors

Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Spark has 3 data representation

RDD(Resilient Distributed Database)

Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing.
It is also fault tolerant collection of elements, which means it can automatically recover from failures.
Is immutable, we can create RDD once but can’t change it.

RDD tracking

Every RDD keeps track of :

where it came from ?
All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Creation of an RDD

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

RDD (RESILIENT DISTRIBUTED DATASETS)

Basic program abstraction in Spark
All operations are performed in memory objects
Collection of entities
It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

How much data does google deal?

Stores about 15 exabytes ( 1000000000000000000B )of data
Process 100 petabytes of data per day
60 trillion pages are indexed
1 billion google search users per month

Note: Fraud detection is an example of real time processing

Limitations of Map reduce

Entire Map reduce job is a batch processing job
Does not allow real time processing of the data

Spark SQL

Is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

DataFrames

Is a distributed collection of data organized into named columns.
It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
DataFrame API is available in Scala, Java, and Python.

Hortonworks

It is very similar to the Apache Hadoop distribution.
We can use Azure blob storage as the default DFS. With that, we can start the cluster only when we need to compute power.
We can bring data to the storage through REST API, or SDKs in different languages rest of the time. Therefore we can create a cluster that has the required size when we want the computation. There is a lot of flexibility but we will lose collocality (which is mainly important in the first map phase).

Is a general-purpose programming language providing support for functional programming and a strong static type system.
Designed to be concise, many of Scala's design decisions aimed to address criticisms of Java
Source code is intended to be compiled to Java bytecode, so that the resulting executable code runs on a Java virtual machine. Scala provides language interoperability with Java
Scala is object-oriented, and has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching

Labels