Showing posts with label AWS. Show all posts
Showing posts with label AWS. Show all posts

Tuesday, January 21, 2020

Difference between AWS glue and Hive warehouse




Apache Hive vs AWS Glue: What are the differences?
Apache Hive: Data Warehouse Software for Reading, Writing, and Managing Large Datasets. Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage; AWS Glue:Fully managed extract, transform, and load (ETL) service. A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
Apache Hive and AWS Glue can be primarily classified as "Big Data" tools.
Some of the features offered by Apache Hive are:
  • Built on top of Apache Hadoop
  • Tools to enable easy access to data via SQL
  • Support for extract/transform/load (ETL), reporting, and data analysis

Thursday, January 31, 2019

Aws cloud formation


Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

TEMPLATES
We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

  •  To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag  resource JSON script would be generated.

Tuesday, July 17, 2018

Create pair key in Aws







SSH to Aws



Create new AWS cluster



Movie ratings project part 2 (Data ingestion)


cont  from movie recommendation part 1

link to github

MILLIONS DATASET (MOVIES, RATINGS, USER)

RENAME FILES
cd million
mv movies.dat movies
mv ratings.dat ratings
mv users.dat users

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

Movie ratings project part 1 (data ingestion)


link to github

ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root


hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /

Sunday, April 22, 2018

Overview of S3


  • Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
  • Is an object store, not a file system.
  • Highly scalable, reliable, fast, inexpensive data storage infrastructure
  • Uses eventually consistency model


Wednesday, February 21, 2018

Why Stream Storage?


Need for stream storage

  • Decouple producers & consumers
  • Persistent buffer
  • Collect multiple streams
  • Preserve client ordering
  • Parallel consumption
  • Streaming Map Reduce



Message and Stream Storage



Amazon SQS

  • Amazon Simple Queue Service (SQS) is a fully managed message queuing service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications.
  • Building applications from individual components that each perform a discrete function improves scalability and reliability, and is best practice design for modern applications.