Friday, February 9, 2018

Data Representation in RDD

Spark has 3 data representation

  1. RDD(Resilient Distributed Database) 

    • Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing. 
    • It is also fault tolerant collection of elements, which means it can automatically recover from failures. 
    • Is immutable, we can create RDD once but can’t change it.

    • It is also a distributed collection of data. 
    • A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). 
    • Dataset API is only available in Scala and Java. It is not available in Python and R.
  • DataFrame: 
    • Is a distributed collection of data organized into named columns. 
    • It is conceptually equivalent to a table in a relational database or a data frame. 
    • It is mostly used for structured data processing. 
    • In Scala, a DataFrame is represented by a Dataset of Rows. 
    • A DataFrame can be constructed by wide range of arrays for example, existing RDDs, Hive tables, database tables.

History of Spark API

The snapshot shows the history of dataframes.

No comments:

Post a Comment