Friday, February 9, 2018

Overview of Spark streaming

How much data does google  deal?
  • Stores about 15 exabytes ( 1000000000000000000B )of data 
  • Process 100 petabytes of data per day
  • 60 trillion pages are indexed
  • 1 billion google search users per month
Note: Fraud detection is an example of real time processing

Limitations of Map reduce
  • Entire Map reduce job is a batch processing job
  • Does not allow real time processing of the data

Streaming data:
 Continuous flow of information from one or more sources is called Streaming data

Stream processing:
 The mutation/transformation that we perform on these data are called
stream processing

Spark streaming

  • It is able to work on streaming data and perform stream processing on the stream
  • Dealing with real data in real time
  • Better alternative to Hadoop when manipulating data streams
  • Extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. 
  • Can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with highlevel functions like map, reduce, join and window. 
  • Processed data can be pushed out to filesystems, databases, and live dashboards.
Spark Streaming Module

  • Streams of data are made up of discrete entities
  • Streams arrives at a input and needs to be processed at real time

  Ex: Log messages, tweets, GPS location information (latitude and logngitute)

1)We need to process individual entities or group of entities
  Ex mood on twitter
2) Once we processed the entities we transform it to desired resultant form
3) This might be stored in a reliable storage or passed on to another applicatin
   or acted on a certain way

  Trigger an alert, Show trending graphs, Display route on the map

No comments:

Post a Comment