Friday, February 9, 2018

Data Representation in RDD


Spark has 3 data representation


  1. RDD(Resilient Distributed Database) 

    • Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing. 
    • It is also fault tolerant collection of elements, which means it can automatically recover from failures. 
    • Is immutable, we can create RDD once but can’t change it.



       2.Dataset: 
    • It is also a distributed collection of data. 
    • A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). 
    • Dataset API is only available in Scala and Java. It is not available in Python and R.
  • DataFrame: 
    • Is a distributed collection of data organized into named columns. 
    • It is conceptually equivalent to a table in a relational database or a data frame. 
    • It is mostly used for structured data processing. 
    • In Scala, a DataFrame is represented by a Dataset of Rows. 
    • A DataFrame can be constructed by wide range of arrays for example, existing RDDs, Hive tables, database tables.

History of Spark API

The snapshot shows the history of dataframes.



7 comments:

  1. Thank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.

    Data Science Training in Bangalore

    ReplyDelete
  2. Excellent blog with lots of information, keep sharing. I am waiting for your more posts like this or related to any other informative topic.Amazing web journal I visit this blog it's extremely marvelous. Interestingly, in this blog content composed plainly and reasonable. The substance of data is educationalData Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training

    ReplyDelete
  3. The primary thought of website streamlining is to increment unpaid guests to your site through Web optimization URLs or internet searcher well disposed URLs. tor links directory

    ReplyDelete