SPARK PROCESSING
- Distributed and parallel processing
- Each executor has separate copies (variables and functions)
- No propagation data back to the driver (Except on certain necessary cases)
- Are added through an associative and commutative operation.Can be supported in parallel
- Used to implement counters or sums
- Naively supports Numeric Types and programmers can add support for new types
- May not be reliable. There are case of failed task
- Potential duplicate counts
- Can have named and unnamed accumulators. Named accumulators will be displayed in Web UI page.
accum=spark_one.sparkContext.accumulator(0) print(accum) def add_accum(): accum.add(1) accum.add(2) add_accum() print(accum) 0 3
BROADCAST VARIABLES
- Read only varaibles
- Immutable
- Fits in memory
- Distributed efficiently to the cluster
- Do not modify after shipped
- Preferred for machine learning and Lookup tables
- We cannot have dataframes in cache
broadcast_var=spark_one.sparkContext.broadcast([1,2,3]) print(broadcast_var) broadcast_var.value <pyspark.broadcast.Broadcast object at 0x7f1a312e3208> [1, 2, 3]
STORAGE LEVELS
- memory_only
- memory_and_disk
- disk_only
- memory_only_2
- memory_and_disk_2
- off_heap
- memory_only_ser
- memory_and_disk_ser
Great blog, I was searching this for a while. Do post more like this.
ReplyDeleteGST classes in chennai
GST Training institute in chennai
Salesforce Training in Chennai
AngularJS Training in Chennai
Tally course in Chennai
ccna course in Chennai
Ethical Hacking Training in Chennai
Hacking course in Chennai
Web Designing course in Chennai
ui ux design course in Chennai
Great job here on _______ I read a lot of blog posts, but I never heard a topic like this. I Love this topic you made about the blogger's bucket list. Very resourceful. 해외스포츠중계
ReplyDeleteThanks for posting this info. I just want to let you know that I just check out your site and I find it very interesting and informative. I can't wait to read lots of your posts. 독일축구중계
ReplyDelete