Web Snippets: AWS

Showing posts with label AWS. Show all posts

Tuesday, January 21, 2020

Difference between AWS glue and Hive warehouse

Apache Hive vs AWS Glue: What are the differences?

Apache Hive: Data Warehouse Software for Reading, Writing, and Managing Large Datasets. Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage; AWS Glue:Fully managed extract, transform, and load (ETL) service. A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.

Apache Hive and AWS Glue can be primarily classified as "Big Data" tools.

Some of the features offered by Apache Hive are:

Built on top of Apache Hadoop
Tools to enable easy access to data via SQL
Support for extract/transform/load (ETL), reporting, and data analysis

Aws cloud formation

Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

TEMPLATES
We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag resource JSON script would be generated.

Create pair key in Aws

cont from movie recommendation part 1

link to github

MILLIONS DATASET (MOVIES, RATINGS, USER)

RENAME FILES

cd million
mv movies.dat movies
mv ratings.dat ratings
mv users.dat users

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

link to github

ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root

hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /

Overview of S3

Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
Is an object store, not a file system.
Highly scalable, reliable, fast, inexpensive data storage infrastructure
Uses eventually consistency model

Why Stream Storage?

Need for stream storage

Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Parallel consumption
Streaming Map Reduce

Message and Stream Storage

Amazon SQS

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications.
Building applications from individual components that each perform a discrete function improves scalability and reliability, and is best practice design for modern applications.

Web Snippets

Labels

Tuesday, January 21, 2020

Difference between AWS glue and Hive warehouse

Thursday, January 31, 2019

Aws cloud formation

Tuesday, July 17, 2018

Create pair key in Aws

SSH to Aws

Create new AWS cluster

Movie ratings project part 2 (Data ingestion)

MILLIONS DATASET (MOVIES, RATINGS, USER)

RATINGS

Movie ratings project part 1 (data ingestion)

Sunday, April 22, 2018