Web Snippets: Accumulators and Broadcast Variables in Spark

Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark

SPARK PROCESSING

Distributed and parallel processing
Each executor has separate copies (variables and functions)
No propagation data back to the driver (Except on certain necessary cases)

ACCUMULATOR

Are added through an associative and commutative operation.Can be supported in parallel
Used to implement counters or sums
Naively supports Numeric Types and programmers can add support for new types
May not be reliable. There are case of failed task
Potential duplicate counts
Can have named and unnamed accumulators. Named accumulators will be displayed in Web UI page.

accum=spark_one.sparkContext.accumulator(0)
print(accum)
def add_accum():
    accum.add(1)
    accum.add(2)
add_accum()
print(accum) 

0
3

BROADCAST VARIABLES

Read only varaibles
Immutable
Fits in memory
Distributed efficiently to the cluster
Do not modify after shipped
Preferred for machine learning and Lookup tables
We cannot have dataframes in cache

broadcast_var=spark_one.sparkContext.broadcast([1,2,3])
print(broadcast_var)
broadcast_var.value

<pyspark.broadcast.Broadcast object at 0x7f1a312e3208>
[1, 2, 3]

STORAGE LEVELS

memory_only
memory_and_disk
disk_only
memory_only_2
memory_and_disk_2
off_heap

memory_only_ser
memory_and_disk_ser

2 comments:

JamesyJuly 18, 2021 at 5:42 AM
Great job here on _______ I read a lot of blog posts, but I never heard a topic like this. I Love this topic you made about the blogger's bucket list. Very resourceful. 해외스포츠중계
ReplyDelete
Replies
mubeenfaisalAugust 7, 2021 at 6:19 AM
Thanks for posting this info. I just want to let you know that I just check out your site and I find it very interesting and informative. I can't wait to read lots of your posts. 독일축구중계
ReplyDelete
Replies

Add comment

Web Snippets

Labels

Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark

2 comments:

Labels

Blog Archive