Friday, February 9, 2018

Receivers in Spark Streaming


  • Task which collects input data from different sources
  • Spark allocates a receiver for each input source
  • Special task that run on the executors 

Spark Architecture


Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Data Representation in RDD


Spark has 3 data representation

  1. RDD(Resilient Distributed Database) i
    • Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing. 
    • It is also fault tolerant collection of elements, which means it can automatically recover from failures. 
    • Is immutable, we can create RDD once but can’t change it.

Lineage of RDD


RDD tracking 

Every RDD keeps track of :

  1. where it came from ?
  2.  All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways
  1. Read a file: Individual rows or records become an element in the RDD
  2. Transform another RDD:

Apache Spark RDD


RDD (RESILIENT DISTRIBUTED DATASETS)

  • Basic program abstraction in Spark
  • All operations are performed in memory objects
  • Collection of entities
  • It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

Overview of Spark streaming

How much data does google  deal?
  • Stores about 15 exabytes ( 1000000000000000000B )of data 
  • Process 100 petabytes of data per day
  • 60 trillion pages are indexed
  • 1 billion google search users per month
Note: Fraud detection is an example of real time processing

Limitations of Map reduce
  • Entire Map reduce job is a batch processing job
  • Does not allow real time processing of the data