Web Snippets: Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

Methods in RDD

RDD's are created using 2 types of RDD

Transformation:

Processing step that can be applied to every element in an RDD to transform one RDD to another

Action:

Read data from an RDD (Ex take)

Basic transformation methods

parallelize

Creates an RDD in memory from data

The data can be provided

By external source
Can be created from external data : From hdfs, maprfs or object store like s3
Can be created from other rdd

val x = sc.parallelize(1 to 10, 2)

filter
Return a new dataset formed by selecting those elements of the source on which func returns true.

val x = sc.parallelize(1 to 10, 2)
val y = x.filter(_ % 2 == 0)
y.collect
Array[Int] = Array(2, 4, 6, 8, 10)

map :

Return a new distributed dataset formed by passing each element of the source through a function func


val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"), 3)
val y = x.map(x => (x, 1))
y.collect

Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))






flatMap

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)



val x   =   sc.parallelize(List("spark","rdd", "example","sample", "example"),3)  
val wordcount = x.flatMap(line => line.split(" ")).map(word => (word, 1))
                .reduceByKey(_ + _)
wordcount.collect
Array[(String, Int)] = Array((example,2), (spark,1), (rdd,1), (sample,1))





groupBy

Method takes a predicate function as its parameter and uses it to group elements by

key and values into a Map collection.



val x = sc.parallelize(List("spark rdd example", "sample example"), 2)
val y = x.groupBy(_.charAt(3))
y.collect

Array[(Char, Iterable[String])] = Array((p,CompactBuffer(sample example)),
 (r,CompactBuffer(spark rdd example)))

Basic Action methods

collect
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

val x = sc.parallelize(1 to 10, 2)
val y = x.filter(_ % 2 == 0)
y.collect
Array[Int] = Array(2, 4, 6, 8, 10)

take
take the first num elements of the RDD

scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice"))
names2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1478] 
at parallelize at <console>:12
 
scala> names2.take(2)
res786: Array[String] = Array(apple, beatty)

Web Snippets

Labels

Friday, February 9, 2018

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

Methods in RDD

RDD's are created using 2 types of RDD

Basic transformation methods

Basic Action methods

1 comment:

Labels

Blog Archive

Labels

Friday, February 9, 2018

Common methods in RDD

Creation of an RDD

RDD's can be created in 2 ways Read a file: Individual rows or records become an element in the RDD Transform another RDD: Methods in RDD

RDD's are created using 2 types of RDD

Basic transformation methods

Basic Action methods

1 comment:

Labels

RDD's can be created in 2 ways

Read a file: Individual rows or records become an element in the RDD

Transform another RDD:

Methods in RDD