Friday, February 9, 2018

Modules in Apache Spark

Spark SQL

  • Is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
    • DataFrames 
      • Is a distributed collection of data organized into named columns. 
      • It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. 
      • DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
      • DataFrame API is available in Scala, Java, and Python.
  • Lets developers query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. 
  • Allows unified data access, which means that it can accept data from multiple sources seamlessly. 
  • Is compatible with Hive, which means that Hive queries can be run on existing warehouses without any modification. 
  • Can use existing Hive metastores, Serializer Desearializer(SerDes) and user defined functions (UDFs)

Spark Streaming

  • Dealing with real data in real time
  • Better alternative to Hadoop when manipulating data streams
  • Extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. 
  • Can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with highlevel functions like map, reduce, join and window. 
  • Processed data can be pushed out to filesystems, databases, and live dashboards.


    • MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction and underlying optimization primitives. 
    • Popular Machine Learning algorithms is available for everyone to use in an easy manner.


    •  Is a new component in Spark for graphs and graph-parallel computation. 
    • Extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators like subgraph, joinVertices, and aggregateMessages as well as an optimized variant of the Pregel API. 
    • Includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

    No comments:

    Post a Comment