Sunday, April 22, 2018

Overview of S3


  • Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
  • Is an object store, not a file system.
  • Highly scalable, reliable, fast, inexpensive data storage infrastructure
  • Uses eventually consistency model




HDFS vs S3



Disckcp 
  • DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. 
  • It uses MapReduce to effect its distribution, error handling and recovery, and reporting. 
  • It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. 
  • Diskcp can be used to move data from s3 to hdfs or amazon version S3diskcp tool 
  • We can use hive or impala to query the data from s3 
  • We can use data-frames in spark to point to parquet files 
  • Moving data from hdfs to S3 , we can also use spark and diskcp

1 comment: