Tuesday, July 17, 2018

Movie ratings project part 2

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

FIND AND REPLACE DELIMITER
more ratings.dat | sed -e 's/::/@/g' > ratings_clean

CREATE DIRECTORY FOR RATINGS
hdfs dfs -mkdir /hackerday_ratings/million/ratings


MOVE RATINGS DATASET TO HDFS
 hdfs dfs -copyFromLocal /home/hadoop/million/ratings_clean 
/hackerday_ratings/million/ratings

VALIDATE DATA
 hdfs dfs -cat /hackerday_ratings/million/ratings/ratings_clean | head -n 10

CREATE EXTERNAL TABLE FOR MOVIE RATINGS
use hackerday_ratings;
drop table million_ratings
create external table million_ratings (
user_id int,
movie_id int,
rating double,
rating_time bigint
)
row format delimited
fields terminated by '@'
lines terminated by '\n'
location '/hackerday_ratings/million/ratings';

SELECT * FROM  million_ratings LIMIT 10

USERS

USERS FILE DESCRIPTION
================================================================================

User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

 *  1:  "Under 18"
 * 18:  "18-24"
 * 25:  "25-34"
 * 35:  "35-44"
 * 45:  "45-49"
 * 50:  "50-55"
 * 56:  "56+"

- Occupation is chosen from the following choices:

 *  0:  "other" or not specified
 *  1:  "academic/educator"
 *  2:  "artist"
 *  3:  "clerical/admin"
 *  4:  "college/grad student"
 *  5:  "customer service"
 *  6:  "doctor/health care"
 *  7:  "executive/managerial"
 *  8:  "farmer"
 *  9:  "homemaker"
 * 10:  "K-12 student"
 * 11:  "lawyer"
 * 12:  "programmer"
 * 13:  "retired"
 * 14:  "sales/marketing"
 * 15:  "scientist"
 * 16:  "self-employed"
 * 17:  "technician/engineer"
 * 18:  "tradesman/craftsman"
 * 19:  "unemployed"
 * 20:  "writer"
FIND AND REPLACE DELIMITER
more users.dat | sed -e 's/::/@/g' > users_clean

CREATE USERS DIRECTORY IN HDFS
hdfs dfs -mkdir /hackerday_ratings/million/users

COPY USERS DATASET TO HDFS
 hdfs dfs -copyFromLocal /home/hadoop/million/users_clean  
/hackerday_ratings/million/users

CREATE EXTERNAL TABLE FOR USERS
drop table if exists million_users
create external table million_users (
user_id int,
gender char(1),
age tinyint,
occupation varchar(20),
zip_code varchar(10)
)
row format delimited
fields terminated by '@'
lines terminated by '\n'
location '/hackerday_ratings/million/users';

SELECT * FROM  million_users LIMIT 10


LATEST 

Ratings Data File Structure (ratings.csv)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
userId,movieId,rating,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.



CREATE DIRECTORY FOR STORING MOVIE RATINGS
 hdfs dfs -mkdir /hackerday_ratings/latest/latest_ratings

No comments:

Post a Comment