Wednesday, July 25, 2018
Hive partition
- Partitioning improves the time taken to access data by restricting query to only a certain portion of the dataset.
- Care has to be taken as to what will make the partition column.
- Once partition has been created, you can alter some definitions of the partition different from other partitions.
- There is no hard limit on the number of partitions that a hive table can contain.However we still need to be careful
- Querying without the partition column would increase the amount of time the query will complete compared to a non-partitioned table.
- Prefer static partitioning to dynamic for day-to-day data ingestion
- Pre-empt small file scenarios
Tuesday, July 17, 2018
Movie ratings project part 2 (Data ingestion)
MILLIONS DATASET (MOVIES, RATINGS, USER)
RENAME FILES
cd million
mv movies.dat movies
mv ratings.dat ratings
mv users.dat users
RATINGS
RATINGS FILE DESCRIPTION ================================================================================ All ratings are contained in the file "ratings.dat" and are in the following format: UserID::MovieID::Rating::Timestamp - UserIDs range between 1 and 6040 - MovieIDs range between 1 and 3952 - Ratings are made on a 5-star scale (whole-star ratings only) - Timestamp is represented in seconds since the epoch as returned by time(2) - Each user has at least 20 ratings
Movie ratings project part 1 (data ingestion)
link to github
ADD USER
sudo useradd hduser
CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings
LIST CREATED DIRECTORY
hdfs dfs -ls /
ADD NEW USER
sudo usermod -G hadoop hduser
CHECK FOR EXISTING USER
id hduser
CREATE PWD FOR THE USER
sudo passwd hduser
CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings
Login as root
CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /
sudo useradd hduser
CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings
LIST CREATED DIRECTORY
hdfs dfs -ls /
ADD NEW USER
sudo usermod -G hadoop hduser
CHECK FOR EXISTING USER
id hduser
CREATE PWD FOR THE USER
sudo passwd hduser
CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings
Login as root
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings
CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /
Tuesday, July 10, 2018
Dynamically create DataFrames
We can dynamically create a string of rows and then generate a dataframe.
However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below
We need to split lines based on the delimiter. This can be done by writing a split function as shown below
CREATE DATAFRAME
from pyspark.sql.functions import lit
# create rdd for new id data_string ="" for rw in baseline_row.collect(): for i in range(24): hour="h" + str(i+1) hour_value= str(rw[hour]) data = 'Row('+ rw.id +', "unique_id"),' data_string = data_string + data #dynamically generated data for hours print(hourly_data) rdds=spark_session.sparkContext.parallelize([data_string]) rdds.map(split_the_line).toDF().show()
Monday, July 9, 2018
Spark samples (Spark SQL, Window functions , persist )
WRITE AS CSV
df_sample.write.csv("./spark-warehouse/SAMPLE.csv")
WRITE AS CSV WITH HEADER
df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True)
DISPLAY All COLUMNS
#Load csv as dataframe data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True) #Register temp viw data.createOrReplaceTempView("vw_data")
#load data based on the select query load = spark.sql("Select * from vw_data limit 5") load.show()
Subscribe to:
Posts (Atom)
Labels
- Algorithms (52)
- Apache Kafka (7)
- Apache Spark (21)
- Architecture (8)
- Arrays (23)
- Big Data (98)
- Cloud services (6)
- Cognitive technologies (12)
- Data Analytics (3)
- Data Science (6)
- Design (1)
- devOps (1)
- Hadoop (26)
- Hive (11)
- Java (2)
- JavaScript (65)
- JavaScript Run-time (12)
- Machine learning (11)
- Maths (6)
- mongoDb (2)
- MySQL (1)
- Networking (3)
- No SQL (2)
- Node (20)
- Python (28)
- Security (4)
- Spark Grpahx (1)
- Spark MLlib (1)
- Spark Sql (3)
- Spark Streaming (4)
- SQL (40)
- Sqoop (2)
- ssis (3)
- Strings (13)