Sunday, April 22, 2018

Overview of Flume



Flume


  • Distributed data collection service
  • Gets streaming event data from different sources
  • Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data


Types of Network Streams


  • Avro
  • Thrift
  • Syslog
  • Netcat

Use Case


  • Sentiment Analysis and Brand Reputation
  • Analyzing data streams to help deliver insights
  • Quality Control and Production Improvement

Ex: Collecting logs from different banks server
Note:
Flume is a posix based file system.Assumes 60% data is written. The half written can be used.

Flow of data
source --> Channel --> Sink -> run inside daemon process known as agent

Channel

  • We can have more then one channel
  • It is the holding area where events are stored before it is passed to the sink

Sink

      Process the events only through the channel

Agent


  • Is responsible for source, sink and channel
  • Can have multiple channels,source, sinks

Flume Event

  •  Actual payload contains data, timestamp and actual message
  • Divided into 2 parts

     Header

  • There could be zero headers or multiple headers
  • Headers are stored as key value pairs
  • Host IP and timestamp information in the header

     Payload
       Array of bytes which contains the actual data

Interceptors


  • Any events which needs to be sent from the source to the channel can be intercepted by Interceptors
          source  -->  Interceptors   --> channel
  • We can have interceptors between channel and the sync
  • We can have multiple interceptors.Each interceptors can have logic for filter or enrich data

Channel Selectors


  • Exits between source and channel
  • Decides where the flume event should go.
  • All the channel or a particular channel

Types of channel selectors

Replicating channel selector
Puts copy of the flume events in all the channels

Multiplexing channel selector

  • It can write based on selected channel information
  • It can write to different channel based on header information

Routing

Interceptors and selectors would define routing

Sync Processor
Is  the mechanism by which we can create fail over parts load balancing




















No comments:

Post a Comment