Web Snippets: Spark Sql

Showing posts with label Spark Sql. Show all posts

Tuesday, July 10, 2018

SHARK :THE BEGGING OF THE API

SCHEMA RDD

WRITE AS CSV

df_sample.write.csv("./spark-warehouse/SAMPLE.csv")

WRITE AS CSV WITH HEADER

df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True)

DISPLAY All COLUMNS

#Load csv as dataframe
data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True)

#Register temp viw
data.createOrReplaceTempView("vw_data")

#load data based on the select query
load = spark.sql("Select * from vw_data limit 5")
load.show()

Is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

Is a distributed collection of data organized into named columns.
It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
DataFrame API is available in Scala, Java, and Python.