Wednesday, July 25, 2018

File format using hive




SEQUENCE FILE
1
2
3
4
5
6
7
Sequencefile
======================
create external table flight_seq 
 (year smallint,month tinyint,dayofmonth tinyint,dayofweek tinyint,
  lateaircraftdelay smallint)
 stored as sequencefile
location '/user/raj_ops/rawdata/handson_train/airline_performance/flights_seq';


Hive partition

  • Partitioning improves the time taken to access data by restricting query to only a certain portion of the dataset.
  • Care has to be taken as to what will make the partition column.
  • Once partition has been created, you can alter some definitions of the partition different from other partitions.
  • There is no hard limit on the number of partitions that a hive table can contain.However we still need to be careful
  • Querying without the partition column would increase the amount of time the query will complete compared to a non-partitioned table.
  •  Prefer static partitioning to dynamic for day-to-day data ingestion
  •  Pre-empt small file scenarios

Tuesday, July 17, 2018

Create pair key in Aws







SSH to Aws



Create new AWS cluster



Movie ratings project part 2

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

FIND AND REPLACE DELIMITER
more ratings.dat | sed -e 's/::/@/g' > ratings_clean

CREATE DIRECTORY FOR RATINGS
hdfs dfs -mkdir /hackerday_ratings/million/ratings

Movie ratings project part 1



ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root


hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /