Tuesday, February 27, 2018

Message Consumption in Apache kafka


  • Critical concept to understand because it is how consumers can read messages at their own pace and process them independently. 
  • Place holder, It is like a bookmark that maintains the last read position
  • In the case of kafka topic, it is the last read message.
  • The offset is entirely established and maintained by the consumer.Since the consumer is entirely responsible for reading the messages and processing them on its own.
  • Keep track of what it has read and has not read
  • Offset refers to a message identifier


  1.  When a consumer wishes to read from a topic, it must establish a connection with a Broker
  2. After establishing the connection, the consumer will decide what messages it wants to consume
  3. If the consumer has not previously read from the topic , or it has to start over, it will issue a statement to read from the beginning of the topic (Consumer establishing that its message offset for the topic is 0)   

Apache ZooKeeper


  • At the heart of Apache kafka we have a cluster, which consists of hundreds of independent Brokers.
  • Closely associated with the kafka cluster, we have a Zookeeper environment,which provides the Brokers within a cluster, the metadata it needs to operate at scale and reliability.As this metadata is constantly changing, connectivity and chatter between the cluster members and Zookeeper is required.

Team formation in Kafka

  • Hierarchy starts with a controller/supervisor 
  • It is a worker node elected by its peers to officiate in the administrative capacity of a controller
  • The worker node selected as controller is the one that is been around the longest

  • Maintain inventory of what workers are available to take on work.
  • Maintain a list of work items that has been committed to and assigned to workers
  • Maintain active status of the staff and their progress on assigned tasks

Overview of kafka

  • Apache Kafka is a distributed commit log service
  • Functions much like a publish/subscribe messaging system
  • Better throughput
  • Built-in partitioning, replication, and fault tolerance. 
  • Increasingly popular for log collection and stream processing.

Wednesday, February 21, 2018

Why Stream Storage?

Need for stream storage

  • Decouple producers & consumers
  • Persistent buffer
  • Collect multiple streams
  • Preserve client ordering
  • Parallel consumption
  • Streaming Map Reduce

Message and Stream Storage

Amazon SQS

  • Amazon Simple Queue Service (SQS) is a fully managed message queuing service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications.
  • Building applications from individual components that each perform a discrete function improves scalability and reliability, and is best practice design for modern applications.

Types of data store

After collecting the data we  need to store the data in the data store.There are different types of data store.

Types of data store 

In memory : Caches, data structure servers
Database    : SQL & NoSQL databases
Search        : Search engines
File Store    : File systems
Queue          : Message queues
Stream storage: pub/sub message queues

Temperature of Big Data

What is data temperature?

  •  It’s classifying data from hot to cold based on how frequently it is accessed. 
  • Hot data is accessed most frequently and cold data is accessed infrequently. 
       Hot Data
    • Measurements in large-scale analytic environments consistently indicate that less than 20% of the data is accessed by more than 90% of the I/Os in an analytic environment. Such data belongs in memory so we can retrieve it very fast.

Types of data

There are 3 types of data 


  • The source of these data are usually from mobile apps, web apps, data centers.
  • These data are structured and are received as records from RDBMS like MySQL, Oracle etc
    • They could also be received from In-memory data structures.
  • We can receive these data from our data centers.

Tuesday, February 20, 2018

Performance in Hive

Performance can in hive can be achieved by 


  •  Logically break up data
  •   Anytime a new value id added to a column, It doesn't match any of the existing
       partitions new partitions are created       

Friday, February 9, 2018

Receivers in Spark Streaming

  • Task which collects input data from different sources
  • Spark allocates a receiver for each input source
  • Special task that run on the executors 

Spark Architecture

Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Data Representation in RDD

Spark has 3 data representation

  1. RDD(Resilient Distributed Database) 

    • Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing. 
    • It is also fault tolerant collection of elements, which means it can automatically recover from failures. 
    • Is immutable, we can create RDD once but can’t change it.

Lineage of RDD

RDD tracking 

Every RDD keeps track of :

  1. where it came from ?
  2.  All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways
  1. Read a file: Individual rows or records become an element in the RDD
  2. Transform another RDD:

Apache Spark RDD


  • Basic program abstraction in Spark
  • All operations are performed in memory objects
  • Collection of entities
  • It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

Overview of Spark streaming

How much data does google  deal?
  • Stores about 15 exabytes ( 1000000000000000000B )of data 
  • Process 100 petabytes of data per day
  • 60 trillion pages are indexed
  • 1 billion google search users per month
Note: Fraud detection is an example of real time processing

Limitations of Map reduce
  • Entire Map reduce job is a batch processing job
  • Does not allow real time processing of the data

Modules in Apache Spark

Spark SQL

  • Is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
    • DataFrames 
      • Is a distributed collection of data organized into named columns. 
      • It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. 
      • DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
      • DataFrame API is available in Scala, Java, and Python.

Flavors of Hadoop Distribution


  • It is very similar to the Apache Hadoop distribution. 
  • We can use Azure blob storage as the default DFS. With that, we can start the cluster only when we need to compute power. 
  • We can bring data to the storage through REST API, or SDKs in different languages rest of the time. Therefore we can create a cluster that has the required size when we want the computation. There is a lot of flexibility but we will lose collocality (which is mainly important in the first map phase).

Overview of Scala

  • Is a general-purpose programming language providing support for functional programming and a strong static type system. 
  • Designed to be concise, many of Scala's design decisions aimed to address criticisms of Java
  • Source code is intended to be compiled to Java bytecode, so that the resulting executable code runs on a Java virtual machine. Scala provides language interoperability with Java
  • Scala is object-oriented, and  has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching