Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark


SPARK PROCESSING
  • Distributed and parallel processing
  • Each executor has separate copies (variables and functions)
  • No propagation data back to the driver (Except on certain necessary cases)

ACCUMULATOR
  • Are added through an associative and commutative operation.Can be supported in parallel
  • Used to implement counters or sums
  • Naively supports Numeric Types and programmers can add support for new types
  • May not be reliable. There are case of failed task
  • Potential duplicate counts
  • Can have named and unnamed accumulators. Named accumulators will be displayed in Web UI page.
accum=spark_one.sparkContext.accumulator(0)
print(accum)
def add_accum():
    accum.add(1)
    accum.add(2)
add_accum()
print(accum) 

0
3

BROADCAST VARIABLES
  • Read only varaibles
  • Immutable
  • Fits in memory
  • Distributed efficiently to the cluster
  • Do not modify after shipped
  • Preferred for machine learning and Lookup tables
  • We cannot have dataframes in cache
broadcast_var=spark_one.sparkContext.broadcast([1,2,3])
print(broadcast_var)
broadcast_var.value

<pyspark.broadcast.Broadcast object at 0x7f1a312e3208>
[1, 2, 3]


STORAGE LEVELS
  • memory_only
  • memory_and_disk
  • disk_only
  • memory_only_2
  • memory_and_disk_2
  • off_heap


  • memory_only_ser
  • memory_and_disk_ser

No comments:

Post a Comment