Friday, February 9, 2018

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways
  1. Read a file: Individual rows or records become an element in the RDD
  2. Transform another RDD:

Methods in RDD

RDD's are created using 2 types of RDD 

Processing step that can be applied to every element in an RDD to transform one RDD to another

Read data from an RDD (Ex take)

Basic transformation methods

Creates an RDD in memory from data
The data can be provided
  • By external source 
  • Can be created from external data : From hdfs, maprfs or object store like s3
  • Can be created from other rdd
val x = sc.parallelize(1 to 10, 2)

Return a new dataset formed by selecting those elements of the source on which func returns true.

val x = sc.parallelize(1 to 10, 2)
val y = x.filter(_ % 2 == 0)
Array[Int] = Array(2, 4, 6, 8, 10)

map :
Return a new distributed dataset formed by passing each element of the source through a function func
val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"), 3)
val y = => (x, 1))

Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)
val x   =   sc.parallelize(List("spark","rdd", "example","sample", "example"),3)  
val wordcount = x.flatMap(line => line.split(" ")).map(word => (word, 1))
                .reduceByKey(_ + _)
Array[(String, Int)] = Array((example,2), (spark,1), (rdd,1), (sample,1))
Method takes a predicate function as its parameter and uses it to group elements by 
key and values into a Map collection.

val x = sc.parallelize(List("spark rdd example", "sample example"), 2)
val y = x.groupBy(_.charAt(3))

Array[(Char, Iterable[String])] = Array((p,CompactBuffer(sample example)),
 (r,CompactBuffer(spark rdd example)))

Basic Action methods

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
val x = sc.parallelize(1 to 10, 2)
val y = x.filter(_ % 2 == 0)
Array[Int] = Array(2, 4, 6, 8, 10)

take the first num elements of the RDD
scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice"))
names2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1478] 
at parallelize at <console>:12
scala> names2.take(2)
res786: Array[String] = Array(apple, beatty)

1 comment: