Tuesday, April 24, 2018

Mean, Median and Mode

The "average" number; found by adding all data points and dividing by the number of data points.

Sunday, April 22, 2018

Kafka partitions

  • Each topic has one or more partitions
  • The no of topics in kafka is dependent on the circumstances in which Apache Kafka is intended to be used.It can be configurable
  • A partition is the basis for which kafka can
    • Scale
    • Become fault-tolerant
    • Achieve higher level of throughput
  • Each partitions are maintained at at-least one or more brokers
 Note: Each partition must fit on an entire machine. If we have one partition for a large and growing topic, we would be limited by the one broker node's ability to capture and retain messages being published to that topic. We would also run into IO constraints

Overview of S3

  • Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
  • Is an object store, not a file system.
  • Highly scalable, reliable, fast, inexpensive data storage infrastructure
  • Uses eventually consistency model

Markdown Cheat Sheet (Jupyter Notebook)


# H1
## H2
### H3
#### H4
##### H5
###### H6

Alternatively, for H1 and H2, an underline-ish style:



Overview of Pig

  •  Apache Pig is a high-level platform for creating programs that run on Apache Hadoop.
  • The language for this platform is called Pig Latin. 
  • Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark
Local mode
  • In local mode, Pig runs in a single JVM and access the local file system. This mode is suitable only for small data sets but not for big data sets.
  • We can set this local mode execution type by using “X” or “exectype”  option. To run in local mode, set the option to local

Shell script


  • Interprets user command which are directly entered by the user or which are read from a file called shell script/program
  • Shell script are interpreted
  • Typical operations performed by shell scripts include file manipulation, program execution, and printing text
Command to identify the shell type which the operating system supports

Overview of Flume


  • Distributed data collection service
  • Gets streaming event data from different sources
  • Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data