Web Snippets: Apache Spark

Showing posts with label Apache Spark. Show all posts

Tuesday, January 21, 2020

Spark Executors and Drivers

SPARK CONSISTS OF DRIVER PROGRAM

Our code that defines which datasources to consume from and which data sources to consume to which is driving all our execution
Responsible for getting the spark context
It would understand how to communicate with the cluster manager.(Yarn mesos or stand-alone)

https://www.youtube.com/watch?v=2CvtwKTjI4Q&vl=en

1) Download specific version of spark
http://spark.apache.org/downloads.html
2) Unzip and create a directory for spark

twittter location clustering based on tweets (Spark Mllib)

1) Create a directory for twitter streams

 cd /usr/lib/spark

 sudo mkdir tweets

 cd tweetscd
 sudo mkdir data

 sudo mkdir training

 sudo chmod  777 /usr/lib/spark/tweets/

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training : Source to train our machine learning algorithm

DSL in Spark

DSL

Stands for domain specific language
Language designed for specific purpose
Data-frames are schema aware
Expose rich domain specific language
Structure data manipulation
SQL like way (Think in SQL)

We can dynamically create a string of rows and then generate a dataframe.

However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below

CREATE DATAFRAME

from pyspark.sql.functions  import lit

# create rdd for new id
data_string =""
for rw in baseline_row.collect():
    for i in range(24):
        hour="h" + str(i+1)
        hour_value= str(rw[hour])
        data = 'Row('+ rw.id +', "unique_id"),'
        data_string = data_string + data

#dynamically generated data for hours 
print(hourly_data)
rdds=spark_session.sparkContext.parallelize([data_string])
rdds.map(split_the_line).toDF().show()

DEPLOYMENT MODE

We can specify local/ cluster mode

Cluster mode :

Driver runs on the cluster even if launched from outside.
Process not killed if the computer submitted is not killed

PARQUET

Design based on Google's Dremel paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical types
All data pushed to leaves of the tree

Spark samples (Spark SQL, Window functions , persist )

WRITE AS CSV

df_sample.write.csv("./spark-warehouse/SAMPLE.csv")

WRITE AS CSV WITH HEADER

df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True)

DISPLAY All COLUMNS

#Load csv as dataframe
data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True)

#Register temp viw
data.createOrReplaceTempView("vw_data")

#load data based on the select query
load = spark.sql("Select * from vw_data limit 5")
load.show()

Simple transformations in Spark

MAP :

map is a transformation operation in Spark hence it is lazily evaluated
It is a narrow operation as it is not shuffling data from one partition to multiple partitions

Note: In this example map takes the values from the array and assigns 1 to all the elements in the array

scala> val x=sc.parallelize(List("spark","rdd","example","sample","example"),3)
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4]

at parallelize at <console>:27

scala> val y=x.map(x=>(x,1))
y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5]

at map at <console>:29

scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1),

(sample,1), (example,1))

We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED

In the eclipse IDE we need to go the help tab and click on Eclipse marketplace

sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

Native support for compiling Scala code and integrating with many Scala test frameworks
Continuous compilation, testing, and deployment
Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
Build descriptions written in Scala using a DSL

Accumulators and Broadcast Variables in Spark

SPARK PROCESSING

Distributed and parallel processing
Each executor has separate copies (variables and functions)
No propagation data back to the driver (Except on certain necessary cases)

Receivers in Spark Streaming

Task which collects input data from different sources
Spark allocates a receiver for each input source
Special task that run on the executors

Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Spark has 3 data representation

RDD(Resilient Distributed Database)

Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing.
It is also fault tolerant collection of elements, which means it can automatically recover from failures.
Is immutable, we can create RDD once but can’t change it.

RDD tracking

Every RDD keeps track of :

where it came from ?
All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Creation of an RDD

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

RDD (RESILIENT DISTRIBUTED DATASETS)

Basic program abstraction in Spark
All operations are performed in memory objects
Collection of entities
It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

How much data does google deal?

Stores about 15 exabytes ( 1000000000000000000B )of data
Process 100 petabytes of data per day
60 trillion pages are indexed
1 billion google search users per month

Note: Fraud detection is an example of real time processing

Limitations of Map reduce

Entire Map reduce job is a batch processing job
Does not allow real time processing of the data

Web Snippets

Labels

Tuesday, January 21, 2020

Spark Executors and Drivers

Installing spark on windows

DStream vs RDD

Monday, February 4, 2019

twittter location clustering based on tweets (Spark Mllib)

Tuesday, July 10, 2018

DSL in Spark

Dynamically create DataFrames

Deployment mode is spark

Examples for compression and file format in spark

Monday, July 9, 2018

Spark samples (Spark SQL, Window functions , persist )

Sunday, June 24, 2018

Simple transformations in Spark

Eclipse market place for scala

Install SBT using yum

Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark

Friday, February 9, 2018

Receivers in Spark Streaming

Spark Architecture

Steps explained

Data Representation in RDD

Spark has 3 data representation

Lineage of RDD

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

Apache Spark RDD

RDD (RESILIENT DISTRIBUTED DATASETS)

Overview of Spark streaming

Labels

Blog Archive

Labels

Tuesday, January 21, 2020

Monday, February 4, 2019

Tuesday, July 10, 2018

Monday, July 9, 2018

Sunday, June 24, 2018

Tuesday, June 19, 2018

Friday, February 9, 2018

Steps explained

Spark has 3 data representation

Creation of an RDD

RDD's can be created in 2 ways Read a file: Individual rows or records become an element in the RDD Transform another RDD:

RDD (RESILIENT DISTRIBUTED DATASETS)

Labels

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD: