Wednesday, July 25, 2018

File format using hive




SEQUENCE FILE
1
2
3
4
5
6
7
Sequencefile
======================
create external table flight_seq 
 (year smallint,month tinyint,dayofmonth tinyint,dayofweek tinyint,
  lateaircraftdelay smallint)
 stored as sequencefile
location '/user/raj_ops/rawdata/handson_train/airline_performance/flights_seq';


Hive partition

  • Partitioning improves the time taken to access data by restricting query to only a certain portion of the dataset.
  • Care has to be taken as to what will make the partition column.
  • Once partition has been created, you can alter some definitions of the partition different from other partitions.
  • There is no hard limit on the number of partitions that a hive table can contain.However we still need to be careful
  • Querying without the partition column would increase the amount of time the query will complete compared to a non-partitioned table.
  •  Prefer static partitioning to dynamic for day-to-day data ingestion
  •  Pre-empt small file scenarios

Tuesday, July 17, 2018

Create pair key in Aws







SSH to Aws



Create new AWS cluster



Movie ratings project part 2

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

FIND AND REPLACE DELIMITER
more ratings.dat | sed -e 's/::/@/g' > ratings_clean

CREATE DIRECTORY FOR RATINGS
hdfs dfs -mkdir /hackerday_ratings/million/ratings

Movie ratings project part 1



ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root


hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /

Tuesday, July 10, 2018

DSL in Spark


DSL 
  • Stands for domain specific language
  • Language designed for specific purpose
  • Data-frames are schema aware
  • Expose rich domain specific language
  • Structure data manipulation
  • SQL like way (Think in SQL)

Dynamically create DataFrames



We can dynamically create a string of rows and then generate a dataframe.
However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below

CREATE DATAFRAME
from pyspark.sql.functions  import lit
# create rdd for new id
data_string =""
for rw in baseline_row.collect():
    for i in range(24):
        hour="h" + str(i+1)
        hour_value= str(rw[hour])
        data = 'Row('+ rw.id +', "unique_id"),'
        data_string = data_string + data

#dynamically generated data for hours 
print(hourly_data)
rdds=spark_session.sparkContext.parallelize([data_string])
rdds.map(split_the_line).toDF().show()

Deployment mode is spark


DEPLOYMENT MODE 
We can specify local/ cluster mode

Cluster mode :
  • Driver runs on the cluster even if launched from outside. 
  • Process not killed if the computer submitted is not killed

Spark samples (RDD, DataFrames,DSL)



SHARK :THE BEGGING OF THE API 
  • SQL using Spark execution engine
  • Evolved into Spark SQL in 1.0
SCHEMA RDD
  • RDD with schema information
  • For unit testing and debugging Spark SQL
  • Drew attention by spark developers
  • Released as DataFrame API in 1.3

Examples for compression and file format in spark

PARQUET
  • Design based on Google's Dremel paper
  • Schema segregated into footer
  • Column major format with stripes
  • Simpler type-model with logical types
  • All data pushed to leaves of the tree

Monday, July 9, 2018

Spark samples (Spark SQL, Window functions , persist )

Image result for spark sql png

WRITE AS CSV
df_sample.write.csv("./spark-warehouse/SAMPLE.csv")

WRITE AS CSV WITH HEADER
df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True) 


DISPLAY ONLY COLUMNS 
#Load csv as dataframe
data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True)

#Register temp viw
data.createOrReplaceTempView("vw_data")

#load data based on the select query
load = spark.sql("Select * from vw_data limit 5")
load.columns