Web Snippets: July 2018

Wednesday, July 25, 2018

File format using hive

SEQUENCE FILE

Sequencefile
======================
create external table flight_seq 
 (year smallint,month tinyint,dayofmonth tinyint,dayofweek tinyint,
  lateaircraftdelay smallint)
 stored as sequencefile
location '/user/raj_ops/rawdata/handson_train/airline_performance/flights_seq';

Partitioning improves the time taken to access data by restricting query to only a certain portion of the dataset.
Care has to be taken as to what will make the partition column.
Once partition has been created, you can alter some definitions of the partition different from other partitions.
There is no hard limit on the number of partitions that a hive table can contain.However we still need to be careful
Querying without the partition column would increase the amount of time the query will complete compared to a non-partitioned table.
Prefer static partitioning to dynamic for day-to-day data ingestion
Pre-empt small file scenarios

Create pair key in Aws

cont from movie recommendation part 1

link to github

MILLIONS DATASET (MOVIES, RATINGS, USER)

RENAME FILES

cd million
mv movies.dat movies
mv ratings.dat ratings
mv users.dat users

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

link to github

ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root

hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /

DSL in Spark

DSL

Stands for domain specific language
Language designed for specific purpose
Data-frames are schema aware
Expose rich domain specific language
Structure data manipulation
SQL like way (Think in SQL)

We can dynamically create a string of rows and then generate a dataframe.

However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below

CREATE DATAFRAME

from pyspark.sql.functions  import lit

# create rdd for new id
data_string =""
for rw in baseline_row.collect():
    for i in range(24):
        hour="h" + str(i+1)
        hour_value= str(rw[hour])
        data = 'Row('+ rw.id +', "unique_id"),'
        data_string = data_string + data

#dynamically generated data for hours 
print(hourly_data)
rdds=spark_session.sparkContext.parallelize([data_string])
rdds.map(split_the_line).toDF().show()

DEPLOYMENT MODE

We can specify local/ cluster mode

Cluster mode :

Driver runs on the cluster even if launched from outside.
Process not killed if the computer submitted is not killed

Spark samples (RDD, DataFrames,DSL)

SHARK :THE BEGGING OF THE API

SQL using Spark execution engine
Evolved into Spark SQL in 1.0

SCHEMA RDD

RDD with schema information
For unit testing and debugging Spark SQL
Drew attention by spark developers
Released as DataFrame API in 1.3

PARQUET

Design based on Google's Dremel paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical types
All data pushed to leaves of the tree

Spark samples (Spark SQL, Window functions , persist )

WRITE AS CSV

df_sample.write.csv("./spark-warehouse/SAMPLE.csv")

WRITE AS CSV WITH HEADER

df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True)

DISPLAY All COLUMNS

#Load csv as dataframe
data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True)

#Register temp viw
data.createOrReplaceTempView("vw_data")

#load data based on the select query
load = spark.sql("Select * from vw_data limit 5")
load.show()

Web Snippets

Labels

Wednesday, July 25, 2018

File format using hive

Hive partition

Tuesday, July 17, 2018

Create pair key in Aws

SSH to Aws

Create new AWS cluster

Movie ratings project part 2 (Data ingestion)

MILLIONS DATASET (MOVIES, RATINGS, USER)

RATINGS

Movie ratings project part 1 (data ingestion)

Tuesday, July 10, 2018

DSL in Spark

Dynamically create DataFrames

Deployment mode is spark

Spark samples (RDD, DataFrames,DSL)

Examples for compression and file format in spark

Monday, July 9, 2018

Spark samples (Spark SQL, Window functions , persist )

Labels

Blog Archive