Showing posts with label Apache Spark. Show all posts
Showing posts with label Apache Spark. Show all posts

Tuesday, January 21, 2020

Spark Executors and Drivers




SPARK CONSISTS OF DRIVER PROGRAM
  • Our code that defines which datasources to consume from and which data sources to consume to which is driving all our execution
  • Responsible for getting the spark context
  • It would understand how to communicate with the cluster manager.(Yarn mesos or stand-alone)

Installing spark on windows




 https://www.youtube.com/watch?v=2CvtwKTjI4Q&vl=en

1) Download specific version of spark
 http://spark.apache.org/downloads.html
2) Unzip and create a directory for spark

DStream vs RDD


Monday, February 4, 2019

twittter location clustering based on tweets (Spark Mllib)



1)  Create a directory for twitter streams
 cd /usr/lib/spark 
 sudo mkdir tweets 
 cd tweetscd
 sudo mkdir data 
 sudo mkdir training
 sudo chmod  777 /usr/lib/spark/tweets/ 

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training :  Source to train our machine learning algorithm

Tuesday, July 10, 2018

DSL in Spark


DSL 
  • Stands for domain specific language
  • Language designed for specific purpose
  • Data-frames are schema aware
  • Expose rich domain specific language
  • Structure data manipulation
  • SQL like way (Think in SQL)

Dynamically create DataFrames



We can dynamically create a string of rows and then generate a dataframe.
However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below

CREATE DATAFRAME
from pyspark.sql.functions  import lit
# create rdd for new id
data_string =""
for rw in baseline_row.collect():
    for i in range(24):
        hour="h" + str(i+1)
        hour_value= str(rw[hour])
        data = 'Row('+ rw.id +', "unique_id"),'
        data_string = data_string + data

#dynamically generated data for hours 
print(hourly_data)
rdds=spark_session.sparkContext.parallelize([data_string])
rdds.map(split_the_line).toDF().show()

Deployment mode is spark


DEPLOYMENT MODE 
We can specify local/ cluster mode

Cluster mode :
  • Driver runs on the cluster even if launched from outside. 
  • Process not killed if the computer submitted is not killed

Examples for compression and file format in spark

PARQUET
  • Design based on Google's Dremel paper
  • Schema segregated into footer
  • Column major format with stripes
  • Simpler type-model with logical types
  • All data pushed to leaves of the tree

Monday, July 9, 2018

Spark samples (Spark SQL, Window functions , persist )

Image result for spark sql png

WRITE AS CSV
df_sample.write.csv("./spark-warehouse/SAMPLE.csv")

WRITE AS CSV WITH HEADER
df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True) 


DISPLAY All COLUMNS 
#Load csv as dataframe
data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True)

#Register temp viw
data.createOrReplaceTempView("vw_data")

#load data based on the select query
load = spark.sql("Select * from vw_data limit 5")
load.show()


Sunday, June 24, 2018

Simple transformations in Spark



MAP :
  • map    is    a    transformation    operation    in    Spark    hence    it    is    lazily    evaluated   
  • It    is    a    narrow    operation    as    it    is    not    shuffling    data    from    one    partition    to    multiple partitions
Note:  In this example map takes the values from the array and assigns 1 to all the elements in the array
scala> val x=sc.parallelize(List("spark","rdd","example","sample","example"),3)
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4]  
at parallelize at <console>:27

scala> val y=x.map(x=>(x,1))
y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] 
at map at <console>:29

scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), 
(sample,1), (example,1))



Eclipse market place for scala


We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED
  • In the eclipse IDE we need to go the help tab and click on Eclipse marketplace


Install SBT using yum


sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

  • Native support for compiling Scala code and integrating with many Scala test frameworks
  • Continuous compilation, testing, and deployment
  • Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
  • Build descriptions written in Scala using a DSL


Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark


SPARK PROCESSING
  • Distributed and parallel processing
  • Each executor has separate copies (variables and functions)
  • No propagation data back to the driver (Except on certain necessary cases)

Friday, February 9, 2018

Receivers in Spark Streaming


  • Task which collects input data from different sources
  • Spark allocates a receiver for each input source
  • Special task that run on the executors 

Spark Architecture


Steps explained

1) We write the program in scala, java or python and submit the application to the spark cluster
2) Spark cluster is made up of multiple systems
3) One of these machine is assigned as the co-ordinator

Data Representation in RDD


Spark has 3 data representation


  1. RDD(Resilient Distributed Database) 

    • Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing. 
    • It is also fault tolerant collection of elements, which means it can automatically recover from failures. 
    • Is immutable, we can create RDD once but can’t change it.


Lineage of RDD


RDD tracking 

Every RDD keeps track of :

  1. where it came from ?
  2.  All transformation it took to reach it's current state

These steps are called Lineage/DAG of an RDD

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways
  1. Read a file: Individual rows or records become an element in the RDD
  2. Transform another RDD:

Apache Spark RDD


RDD (RESILIENT DISTRIBUTED DATASETS)

  • Basic program abstraction in Spark
  • All operations are performed in memory objects
  • Collection of entities
  • It can be assigned to a variable and methods can be invoked on it.Methods return values or apply transformations on the RDDs

Overview of Spark streaming

How much data does google  deal?
  • Stores about 15 exabytes ( 1000000000000000000B )of data 
  • Process 100 petabytes of data per day
  • 60 trillion pages are indexed
  • 1 billion google search users per month
Note: Fraud detection is an example of real time processing

Limitations of Map reduce
  • Entire Map reduce job is a batch processing job
  • Does not allow real time processing of the data

Labels