Extends source of Flume NG for tailing files and folders.
收集日志,具体场景:
HH_2015-02-25.txt
,注:文件名中包含日期信息;HH_2015-02-25.txt
文件中;实际说个例子:
E:/app/log
下;E:/app/log
下,生成文件HH_2015-02-24.txt
,并且,运行信息会实时追加到此文件中;E:/app/log
下,生成新的文件HH_2015-02-25.txt
,并且,将当日的运行信息实时追加到此文件中;特别说明:此场景的解决方案就是下文提到的SpoolDirectoryTailFileSource
。
在上述场景下,要求,实时收集应用的运行日志,整体性能上几点:
工程中,涉及到的组件版本,见下表:
组件 | 版本 |
---|---|
Flume NG | 1.5.2 |
编译工程前,需要预装环境:
本工程使用Apache Maven 3来编译,可以从这里下载不同OS环境下的Maven安装文件。
执行命令:mvn clean package
,默认在${project_root}/target/
目录下,编译生成flume-ng-extends-source-x.x.x.jar
。
flume-ng-extends-source-x.x.x.jar
;方法一:标准插件安装 (Recommended Approach),具体步骤:
${FLUME_HOME}
找到目录plugins.d
,如果没有找到这一目录,则创建目录${FLUME_HOME}/plugins.d
;${FLUME_HOME}/plugins.d
目录下,创建目录flume-ng-extends-source
,并在其下创建lib
和libext
两个子目录;flume-ng-extends-source-x.x.x.jar
复制到plugins.d/flume-ng-extends-source/lib
目录中;至此,安装插件后,目录结构如下:
${FLUME_HOME}
|-- plugins.d
|-- flume-ng-extends-source
|-- lib
|-- flume-ng-extends-source-x.x.x.jar
|-- libext
Flume插件安装的更多细节,参考Flume User Guide
疑问:maven打包时,如何将当前jar包以及其依赖包都导出? 参考thilinamb flume kafka sink
方法二:快速插件安装 (Quick and Dirty Approach),具体步骤:
flume-ng-extends-source-x.x.x.jar
复制到${FLUME_HOME}/lib
目录中;在flume的配置文件flume-conf.properties
中,配置agent
下SpoolDirectoryTailFileSource
source,参考代码如下:
# Spooling dir and tail file Source
agent.sources.spoolDirTailFile.type = com.github.ningg.flume.source.SpoolDirectoryTailFileSource
# on WIN plantform spoolDir should be format like: E:/program files/spoolDir
# Note: the value of spoolDir MUST NOT be surrounded by quotation marks.
agent.sources.spoolDirTailFile.spoolDir = /home/storm/goodjob/spoolDir
agent.sources.spoolDirTailFile.fileSuffix = .COMPLETED
agent.sources.spoolDirTailFile.deletePolicy = never
agent.sources.spoolDirTailFile.ignorePattern = ^$
agent.sources.spoolDirTailFile.targetPattern = .*(\\d){4}-(\\d){2}-(\\d){2}.*
agent.sources.spoolDirTailFile.targetFilename = yyyy-MM-dd
agent.sources.spoolDirTailFile.trackerDir = .flumespooltail
agent.sources.spoolDirTailFile.consumeOrder = oldest
agent.sources.spoolDirTailFile.batchSize = 100
agent.sources.spoolDirTailFile.inputCharset = UTF-8
agent.sources.spoolDirTailFile.decodeErrorPolicy = REPLACE
agent.sources.spoolDirTailFile.deserializer = LINE
详细配置参数如下表(Required properties are in bold.):
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be com.github.ningg.flume.source.SpoolDirectoryTailFileSource . |
spoolDir | – | The directory from which to read files from. |
fileSuffix | .COMPLETED |
Suffix to append to completely ingested files |
deletePolicy | never |
When to delete completed files: never or immediate |
fileHeader | false |
Whether to add a header storing the absolute path filename. |
fileHeaderKey | file |
Header key to use when appending absolute path filename to event header. |
basenameHeader | false |
Whether to add a header storing the basename of the file. |
basenameHeaderKey | basename |
Header Key to use when appending basename of file to event header. |
ignorePattern | ^$ |
Regular expression specifying which files to ignore (skip) |
targetPattern | .*(\\d){4}-(\\d){2}-(\\d){2}.* |
Regular expression specifying which files to collect |
targetFilename | yyyy-MM-dd |
The Target File's DateFormat, which refers to java.text.SimpleDateFormat. Infact, there is strong relationship between Property: targetFilename and Property: targetPattern |
trackerDir | .flumespooltail |
Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir. |
consumeOrder | oldest |
In which order files in the spooling directory will be consumed oldest , youngest and random . In case of oldest and youngest , the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest laxicographical order will be consumed first. In case of random any file will be picked randomly. When using oldest and youngest the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using random may cause old files to be consumed very late if new files keep coming in the spooling directory. |
maxBackoff | 4000 | The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter. |
batchSize | 100 | Granularity at which to batch transfer to the channel |
inputCharset | UTF-8 |
Character set used by deserializers that treat the input file as text. |
decodeErrorPolicy | FAIL |
What to do when we see a non-decodable character in the input file. FAIL : Throw an exception and fail to parse the file. REPLACE : Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD . IGNORE : Drop the unparseable character sequence. |
deserializer | LINE |
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder . |
deserializer.* | Varies per event deserializer.(设置每个deseralizer的实现类,对应的配置参数) | |
bufferMaxLines | – | (Obselete) This option is now ignored. |
bufferMaxLineLength | 5000 | (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead. |
selector.type | replicating |
replicating or multiplexing |
selector.* | Depends on the selector.type value |
|
interceptors | – | Space-separated list of interceptors |
interceptors.* |
补充:上述,selector和interceptor的作用?
使用上述SpoolDirectoryTailFileSource
的几个约束:
deletePolicy
=immediate
时,无此限制)
如果你对这一工程有任何建议,几个途径联系我:
遇到问题去收集资料,对现有的Flume source进行了简单的浏览,发现Flume的Spooling Directory Source机制,很有意思,几点:
这一机制,跟我们遇到的场景很像,具体差异点:
特别说明:
下述3种情况下,SpoolDirectoryTailFileSource
都有很高的可靠性,保证不丢失数据、不重复发送数据,几种情况(启动时间、重启时间,两个维度):
上面都是借助meta文件来实现的。
(doing...)正在整理,具体包含:
来源:Github上apache flume中kafka source
思考几点:
两类jar包:
flume-ng-extends-source-x.x.x.jar
kafka_2.9.2-0.8.2.0.jar
kafka-clients-0.8.2.0.jar
metrics-core-2.2.0.jar
scala-library-2.9.2.jar
zkclient-0.3.jar
注:关于KafkaSource用法的细节,参考文章。