Tuesday, January 24, 2017

About Big data



    • Term used to describe large volume of data.Both structured and unstructured data that include a business on day to day basis.
    • Can be analyzed for insights that lead to better decisions and strategic business moves.
    • Is in existence from many years. Due to cheap hardware and open source solution to the problem and communities it is getting popular

        use case :
      • Machine break down before failure
      • Analyzing data for healthcare studies
      • Prevent fraudulent activities for credit card


      • Open source implementation for googles map reduce using hdfs 
      • Data can be stored or appended cannot be updated
      • Each node would have 3 copies ( 2 copies are backups)

      Apache spark

      • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk (Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.)
                     Ease of Use
      • Write applications quickly in Java, Scala, Python, R ( Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells )
      • Combine SQL, streaming, and complex analytics ( Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.)
                     Runs Everywhere
      • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3 ( You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.)


                          To understand  map reduce.lets consider the example of analyzing the cellphone                       market used in Time Square building. Note there could be atleast 6 kind of providers like                     apple, android, windows etc

    • First we need to start the process with bunch of people going to each floor and collecting the data from each individuals in the floor
    • They would then drop these in the message box in the each floor
    • This would be collected for analyzing in the main office floor
    • We collect the data from all the message box and then start entering these data into an excel file




No comments:

Post a Comment