Wednesday, February 21, 2018

Temperature of Big Data

What is data temperature?

  •  It’s classifying data from hot to cold based on how frequently it is accessed. 
  • Hot data is accessed most frequently and cold data is accessed infrequently. 
       Hot Data
    • Measurements in large-scale analytic environments consistently indicate that less than 20% of the data is accessed by more than 90% of the I/Os in an analytic environment. Such data belongs in memory so we can retrieve it very fast.
      Cold Data
  • The other 80% of the data, which is accessed less than 10% of the time, can be thought of as cold data. 
  • Putting cold data in memory does not make sense from an economic point of view, especially with large volumes of data. If we are talking about 100 gigabytes, then put it all in memory. But if we’re talking about 100 terabytes, it doesn’t make economic sense to put everything in memory

Optimize for Both Cost and Performance

  • The goal of good engineering is to optimize for both cost and performance. 
  • Hot data, data that’s accessed very frequently, like the latest sales numbers, should be in memory. While memory costs more per terabyte for storage than electromechanical disk drives, it is also fast and is the lower cost per I/O infrastructure. 
  • In contrast, data that’s relatively cold should be in the lower cost per terabyte storage provided by disk drives because the low cost per I/O does not matter so much for data that is accessed infrequently. Low cost is key for cold data so that you can store lots of it economically. 
  • This is a big part of the design philosophy for “data lakes” used to capture “all” data forever in a big data environment.