SPARK PROCESSING
- Distributed and parallel processing
- Each executor has separate copies (variables and functions)
- No propagation data back to the driver (Except on certain necessary cases)
- Are added through an associative and commutative operation.Can be supported in parallel
- Used to implement counters or sums
- Naively supports Numeric Types and programmers can add support for new types
- May not be reliable. There are case of failed task
- Potential duplicate counts
- Can have named and unnamed accumulators. Named accumulators will be displayed in Web UI page.
accum=spark_one.sparkContext.accumulator(0) print(accum) def add_accum(): accum.add(1) accum.add(2) add_accum() print(accum) 0 3
BROADCAST VARIABLES
- Read only varaibles
- Immutable
- Fits in memory
- Distributed efficiently to the cluster
- Do not modify after shipped
- Preferred for machine learning and Lookup tables
- We cannot have dataframes in cache
broadcast_var=spark_one.sparkContext.broadcast([1,2,3]) print(broadcast_var) broadcast_var.value <pyspark.broadcast.Broadcast object at 0x7f1a312e3208> [1, 2, 3]
STORAGE LEVELS
- memory_only
- memory_and_disk
- disk_only
- memory_only_2
- memory_and_disk_2
- off_heap
- memory_only_ser
- memory_and_disk_ser
Great job here on _______ I read a lot of blog posts, but I never heard a topic like this. I Love this topic you made about the blogger's bucket list. Very resourceful. 해외스포츠중계
ReplyDeleteThanks for posting this info. I just want to let you know that I just check out your site and I find it very interesting and informative. I can't wait to read lots of your posts. 독일축구중계
ReplyDelete