Sunday, April 22, 2018

Overview of Flume


  • Distributed data collection service
  • Gets streaming event data from different sources
  • Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data

Types of Network Streams

  • Avro
  • Thrift
  • Syslog
  • Netcat

Use Case

  • Sentiment Analysis and Brand Reputation
  • Analyzing data streams to help deliver insights
  • Quality Control and Production Improvement

Ex: Collecting logs from different banks server
Flume is a posix based file system.Assumes 60% data is written. The half written can be used.

Flow of data
source --> Channel --> Sink -> run inside daemon process known as agent


  • We can have more then one channel
  • It is the holding area where events are stored before it is passed to the sink


      Process the events only through the channel


  • Is responsible for source, sink and channel
  • Can have multiple channels,source, sinks

Flume Event

  •  Actual payload contains data, timestamp and actual message
  • Divided into 2 parts


  • There could be zero headers or multiple headers
  • Headers are stored as key value pairs
  • Host IP and timestamp information in the header

       Array of bytes which contains the actual data


  • Any events which needs to be sent from the source to the channel can be intercepted by Interceptors
          source  -->  Interceptors   --> channel
  • We can have interceptors between channel and the sync
  • We can have multiple interceptors.Each interceptors can have logic for filter or enrich data

Channel Selectors

  • Exits between source and channel
  • Decides where the flume event should go.
  • All the channel or a particular channel

Types of channel selectors

Replicating channel selector
Puts copy of the flume events in all the channels

Multiplexing channel selector

  • It can write based on selected channel information
  • It can write to different channel based on header information


Interceptors and selectors would define routing

Sync Processor
Is  the mechanism by which we can create fail over parts load balancing