Wednesday, June 13, 2018

Simple demo for spark streaming


















Login to your shell and open pyspark
[cloudera@quickstart ~]$ pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) 


Run Netcat
[cloudera@quickstart ~]$  nc -l localhost 2182



Type or paste the following code snippet
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working threads and a batch interval of 
2 seconds
ssc = StreamingContext(sc, 2)

# Create a DStream
lines = ssc.socketTextStream("localhost", 2182)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print each batch
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

You should see lots of INFO interspersed with Timestamp corresponding to each batch that is updated every 2 seconds
-------------------------------------------
Time: 2018-06-13 15:24:24
-------------------------------------------

When we type the following words in the netcat window we can see the output in the spark streaming job
[cloudera@quickstart ~]$  nc -l localhost 2182
hello world
hello hello hello
world world world

Output
18/06/13 15:34:06 WARN storage.BlockManager: 
Block input-0-1528929246600 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 2018-06-13 15:34:08
-------------------------------------------
(u'hello', 3)

No comments:

Post a Comment