Friday, February 9, 2018

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways
  1. Read a file: Individual rows or records become an element in the RDD
  2. Transform another RDD:


Methods in RDD

RDD's are created using 2 types of RDD 

Transformation:
Processing step that can be applied to every element in an RDD to transform one RDD to another

Action:
Read data from an RDD (Ex take)


Basic transformation methods

parallelize
Creates an RDD in memory from data
The data can be provided
  • By external source 
  • Can be created from external data : From hdfs, maprfs or object store like s3
  • Can be created from other rdd
val x = sc.parallelize(1 to 10, 2)


filter
Return a new dataset formed by selecting those elements of the source on which func returns true.

val x = sc.parallelize(1 to 10, 2)
val y = x.filter(_ % 2 == 0)
y.collect
Array[Int] = Array(2, 4, 6, 8, 10)

map :
Return a new distributed dataset formed by passing each element of the source through a function func
val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"), 3)
val y = x.map(x => (x, 1))
y.collect

Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))
flatMap
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)
val x   =   sc.parallelize(List("spark","rdd", "example","sample", "example"),3)  
val wordcount = x.flatMap(line => line.split(" ")).map(word => (word, 1))
                .reduceByKey(_ + _)
wordcount.collect
Array[(String, Int)] = Array((example,2), (spark,1), (rdd,1), (sample,1))
groupBy
Method takes a predicate function as its parameter and uses it to group elements by 
key and values into a Map collection.


val x = sc.parallelize(List("spark rdd example", "sample example"), 2)
val y = x.groupBy(_.charAt(3))
y.collect

Array[(Char, Iterable[String])] = Array((p,CompactBuffer(sample example)),
 (r,CompactBuffer(spark rdd example)))

Basic Action methods

collect 
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
val x = sc.parallelize(1 to 10, 2)
val y = x.filter(_ % 2 == 0)
y.collect
Array[Int] = Array(2, 4, 6, 8, 10)

take 
take the first num elements of the RDD
scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice"))
names2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1478] 
at parallelize at <console>:12
 
scala> names2.take(2)
res786: Array[String] = Array(apple, beatty)

1 comment: