- Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
- Is an object store, not a file system.
- Highly scalable, reliable, fast, inexpensive data storage infrastructure
- Uses eventually consistency model
HDFS vs S3
Disckcp
- DistCp (distributed copy) is a tool used for large inter/intra-cluster copying.
- It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
- It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
- Diskcp can be used to move data from s3 to hdfs or amazon version S3diskcp tool
- We can use hive or impala to query the data from s3
- We can use data-frames in spark to point to parquet files
- Moving data from hdfs to S3 , we can also use spark and diskcp
Great Article
ReplyDeleteIEEE Projects for CSE in Big Data
Final Year Project Centers in Chennai
Java Training in Chennai
Java Training in Chennai