Web Snippets: Overview of Spark streaming

Friday, February 9, 2018

Overview of Spark streaming

How much data does google deal?

Stores about 15 exabytes ( 1000000000000000000B )of data
Process 100 petabytes of data per day
60 trillion pages are indexed
1 billion google search users per month

Note: Fraud detection is an example of real time processing

Limitations of Map reduce

Entire Map reduce job is a batch processing job
Does not allow real time processing of the data

Streaming data:
Continuous flow of information from one or more sources is called Streaming data

Stream processing:
The mutation/transformation that we perform on these data are called
stream processing

Spark streaming

It is able to work on streaming data and perform stream processing on the stream
Dealing with real data in real time
Better alternative to Hadoop when manipulating data streams
Extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with highlevel functions like map, reduce, join and window.
Processed data can be pushed out to filesystems, databases, and live dashboards.

Spark Streaming Module

Streams of data are made up of discrete entities
Streams arrives at a input and needs to be processed at real time

Ex: Log messages, tweets, GPS location information (latitude and logngitute)

Note:
1)We need to process individual entities or group of entities
Ex mood on twitter
2) Once we processed the entities we transform it to desired resultant form
3) This might be stored in a reliable storage or passed on to another applicatin
or acted on a certain way

Output
Trigger an alert, Show trending graphs, Display route on the map

Web Snippets

Labels

Friday, February 9, 2018

Overview of Spark streaming

No comments:

Post a Comment

Labels

Blog Archive