Web Snippets: Hive

Showing posts with label Hive. Show all posts

Saturday, February 2, 2019

Movie ratings project part 3 (Analysis)

cont from http://www.prathapkudupublog.com/2018/07/movie-ratings-project-part-2.html

link to github

***************************************************************

1) WHICH YEAR HAS THE MOST NO OF RATINGS
***************************************************************

select year(from_unixtime(rating_time)) rating_year,
       count(*) as cnt
from latest_ratings 
group by year(from_unixtime(rating_time))
order by rating_year DESC;
    
 YEAR    RATING_YEAR
 2018    1086549
 2017    1973721
 2016    2077152
 2015    1907576
 2014    584216
 2013    633573
 2012    792701
 2011    833774
 2010    982104
 2009    993111
 2008    1210356
 2007    1095710
 2006    1210803
 2005    1849719
 2004    1201656
 2003    1079130
 2002    910350
 2001    1239432
 2000    2033738
 1999    1231173
 1998    329704
 1997    763929
 1996    1733263
 1995    4

File format using hive

SEQUENCE FILE

Sequencefile
======================
create external table flight_seq 
 (year smallint,month tinyint,dayofmonth tinyint,dayofweek tinyint,
  lateaircraftdelay smallint)
 stored as sequencefile
location '/user/raj_ops/rawdata/handson_train/airline_performance/flights_seq';

Partitioning improves the time taken to access data by restricting query to only a certain portion of the dataset.
Care has to be taken as to what will make the partition column.
Once partition has been created, you can alter some definitions of the partition different from other partitions.
There is no hard limit on the number of partitions that a hive table can contain.However we still need to be careful
Querying without the partition column would increase the amount of time the query will complete compared to a non-partitioned table.
Prefer static partitioning to dynamic for day-to-day data ingestion
Pre-empt small file scenarios

Movie ratings project part 2 (Data ingestion)

cont from movie recommendation part 1

link to github

MILLIONS DATASET (MOVIES, RATINGS, USER)

RENAME FILES

cd million
mv movies.dat movies
mv ratings.dat ratings
mv users.dat users

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

link to github

ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root

hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /

Permanent UDF in hive

If we create a temporary function then it would be available for the current cli session.

Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';

HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.

Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same.

HIVE UDF FUNCTIONS
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical and relational on the operands of table column names.

We can write the UDF function in java as shown below.
In this example, we are replacing a character string into "*" . We are masking characters which should not be shown to the user.

Performance in Hive

Performance can in hive can be achieved by

PARTITIONING

Logically break up data
Anytime a new value id added to a column, It doesn't match any of the existing
partitions new partitions are created

Masking PII data using Hive

Hive table creation

Create table for import data with fields with CSV

hive> create table Account(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create table for secured account where PI column would be masked

hive> create table Accountmasked(id int,name string,phone string)
row format delimited
fields terminated by ',';

Create contact table

hive> create table contact(id int,accountid int,firstname string,lastName string,
phone string,email string) 
row format delimited 
fields terminated by ',';

Hive Cheat Sheet

Start the hive shell

hive

Create schema in hive

hive> create schema hiveschema location '/hivedatabase/';

Create table with location

use hiveschema;
create external table employee(emp_id int,emp_name string,emp_phone string)
row format delimited
fields terminated by '\t'
location '/hivedatabase/employee';

We can create a temporary function in hive and use the function in our query, however this is valid till the current session is alive.
If we want the function to be permanent we need to create a permanent function.

Web Snippets

Labels

Saturday, February 2, 2019

Movie ratings project part 3 (Analysis)

Wednesday, July 25, 2018

File format using hive

Hive partition

Tuesday, July 17, 2018

Movie ratings project part 2 (Data ingestion)

MILLIONS DATASET (MOVIES, RATINGS, USER)

RATINGS

Movie ratings project part 1 (data ingestion)

Sunday, June 24, 2018

Permanent UDF in hive

Create UDF functions in hive

Tuesday, February 20, 2018

Performance in Hive

Performance can in hive can be achieved by

PARTITIONING

Wednesday, January 31, 2018

Masking PII data using Hive

Hive table creation

Friday, January 26, 2018

Hive Cheat Sheet

Permanent User Defined Function (UDF) in Hive

Steps involved to create a permanent function in Hive

Labels

Blog Archive