Web Snippets: Overview of S3

Sunday, April 22, 2018

Overview of S3

Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
Is an object store, not a file system.
Highly scalable, reliable, fast, inexpensive data storage infrastructure
Uses eventually consistency model

HDFS vs S3

Disckcp

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying.
It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Diskcp can be used to move data from s3 to hdfs or amazon version S3diskcp tool
We can use hive or impala to query the data from s3
We can use data-frames in spark to point to parquet files
Moving data from hdfs to S3 , we can also use spark and diskcp

1 comment:

SankarOctober 5, 2019 at 6:48 AM
Great Article
IEEE Projects for CSE in Big Data
Final Year Project Centers in Chennai

Java Training in Chennai
Java Training in Chennai
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)