FLUME PERFORMANCE TUNING

Batch Size (Source/Sink)

How-to: Do Apache Flume Performance Tuning (Part 1)
하나의 트랜잭션에 Source가 Channel에 Put하거나 Sink가 Take하는 Event 개수
- 만일 batch size가 적은 개수만큼 이벤트가 발생하면?
  - batch timeout안에 batch size가 다 차지 않으면 Put/Take 수행함. (무한정 기다리지 않음)
    - batch timeout은 Source에 따라 다름.
    - ExecSource: batchTimeout = 기본 3000ms
    - KafkaSource: batchDurationMillis = 기본 1000ms
  - 초당 메시지 유입 건수를 고려하여 batch size와 batch timeout 설정을 결정
Tradeoff
- throughput / latency
- performance / duplication
Source, Sink의 Property로 설정
batchSize of sources and sinks <= transactionCapacity of channels
- transactionCapaciry: Channel의 max batch size. (Memory Channel의 경우, Put/Take용 queue size임)
To squeeze all the performance possible out of a Flume system, batch sizes should be tuned with care through experimentation.

File Channel

dataDirs
- Using multiple directories on separate disks can improve file channel performance.
About Apache Flume FileChannel
- WAL and Queue
  - WAL: transaction log (transaction id, seq #, event data)
    - the WAL is named Log.
    - the WAL is a set of files written and read from using the LogFile class and its subclasses.
  - Queue: event pointer queue (in-memory)
    - The queue described above is named FlumeEventQueue.
    - The queue itself is a circular array and is backed by a Memory Mapped File.
- Transaction
  - Each transaction is written to the WAL based on the transaction type (Take or Put) and the queue is modified accordingly.
  - Each time a transaction is committed, fsync is called on the appropriate file to ensure the data is actually on disk and a pointer to that event is placed on a queue.
  - During a take, a pointer is removed from the queue. The event is then read directly from the WAL. (common for that read to occur from the os file cache)
- Checkpoint
  - 장애 후 재기동 시, Flume은 WAL을 replay하면서 Queue를 복원
  - Checkpoint : Replaying WALs can be time consuming, so the queue itself is written to disk periodically. (disk에는 queue와 queue 저장 시 최종 seq #를 저장)
  - After a crash, the queue is loaded from disk and logs after last seq # will be replayed. (Replay 시간 절약)
  - During the checkpoint operation the channel is locked so that no Put or Take operations can alter it’s state.
  - Issue: Takes and Puts which are in progress at the time the checkpoint occurs are lost.

FLUME PERFORMANCE TUNING

Batch Size (Source/Sink)

File Channel

Further Reading

FLUME KAFKASINK에서 KAFKA LOG의 PARTITIONING KEY 설정

APACHE FLUME VS. LOGSTASH

HIVE QUERY의 OUTPUT이 작은 파일로 쪼개지는 경우

Trending Tags