Sunday, April 22, 2018

Overview of Flume


  • Distributed data collection service
  • Gets streaming event data from different sources
  • Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data

Types of Network Streams

  • Avro
  • Thrift
  • Syslog
  • Netcat

Use Case

  • Sentiment Analysis and Brand Reputation
  • Analyzing data streams to help deliver insights
  • Quality Control and Production Improvement

Ex: Collecting logs from different banks server
Flume is a posix based file system.Assumes 60% data is written. The half written can be used.

Flow of data
source --> Channel --> Sink -> run inside daemon process known as agent


  • We can have more then one channel
  • It is the holding area where events are stored before it is passed to the sink


      Process the events only through the channel


  • Is responsible for source, sink and channel
  • Can have multiple channels,source, sinks

Flume Event

  •  Actual payload contains data, timestamp and actual message
  • Divided into 2 parts


  • There could be zero headers or multiple headers
  • Headers are stored as key value pairs
  • Host IP and timestamp information in the header

       Array of bytes which contains the actual data


  • Any events which needs to be sent from the source to the channel can be intercepted by Interceptors
          source  -->  Interceptors   --> channel
  • We can have interceptors between channel and the sync
  • We can have multiple interceptors.Each interceptors can have logic for filter or enrich data

Channel Selectors

  • Exits between source and channel
  • Decides where the flume event should go.
  • All the channel or a particular channel

Types of channel selectors

Replicating channel selector
Puts copy of the flume events in all the channels

Multiplexing channel selector

  • It can write based on selected channel information
  • It can write to different channel based on header information


Interceptors and selectors would define routing

Sync Processor
Is  the mechanism by which we can create fail over parts load balancing


  1. You should remind your staff of your company's journey over the past few years and make sure they clearly understand where you plan to take it in the next few years. Salesforce training in Hyderabad

  2. That gives off an impression of being brilliant anyway I am as yet not very sure that I like it. At any rate will look unmistakably more into it and choose actually! SEO consultant