Creation of an RDD
RDD's can be created in 2 ways
- Read a file: Individual rows or records become an element in the RDD
- Transform another RDD:
Methods in RDD
RDD's can be created in 2 ways
- Read a file: Individual rows or records become an element in the RDD
- Transform another RDD:
RDD's are created using 2 types of RDD
Processing step that can be applied to every element in an RDD to transform one RDD to another
Action:
Action:
Read data from an RDD (Ex take)
filter
Return a new dataset formed by selecting those elements of the source on which func returns true.
map :
Basic transformation methods
parallelize
Creates an RDD in memory from data
The data can be provided
- By external source
- Can be created from external data : From hdfs, maprfs or object store like s3
- Can be created from other rdd
val x = sc.parallelize(1 to 10, 2)
Return a new dataset formed by selecting those elements of the source on which func returns true.
val x = sc.parallelize(1 to 10, 2) val y = x.filter(_ % 2 == 0) y.collect Array[Int] = Array(2, 4, 6, 8, 10)
map :
Return a new distributed dataset formed by passing each element of the source through a function func
val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"), 3) val y = x.map(x => (x, 1)) y.collect Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))flatMapSimilar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)val x = sc.parallelize(List("spark","rdd", "example","sample", "example"),3)
val wordcount = x.flatMap(line => line.split(" ")).map(word => (word, 1))
.reduceByKey(_ + _) wordcount.collect Array[(String, Int)] = Array((example,2), (spark,1), (rdd,1), (sample,1))groupBy
Method takes a predicate function as its parameter and uses it to group elements by
key and values into a Map collection.val x = sc.parallelize(List("spark rdd example", "sample example"), 2) val y = x.groupBy(_.charAt(3)) y.collect Array[(Char, Iterable[String])] = Array((p,CompactBuffer(sample example)), (r,CompactBuffer(spark rdd example)))
Basic Action methods
collectReturn all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
val x = sc.parallelize(1 to 10, 2) val y = x.filter(_ % 2 == 0) y.collect Array[Int] = Array(2, 4, 6, 8, 10)
take
take the first num elements of the RDD
scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice")) names2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1478] at parallelize at <console>:12 scala> names2.take(2) res786: Array[String] = Array(apple, beatty)
Thanks for sharing a good article. This is very interesting and I like this type of article only. I have always read important article like this.
ReplyDeleteC C++ Training in Chennai
C Training in Chennai
core java training in chennai
javascript training in chennai
javascript course in chennai
core java training in chennai
core java training