正文

大数据flume企业开发实战(代码片段)

赵广陆  赵广陆  2023-02-16  814

关键词：

1 复制和多路复用

1.1 案例需求

使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储
到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 LocalFileSystem。

1.2 需求分析：单数据源多出口案例（选择器）

1.3 实现步骤

（1）准备工作
在/opt/module/flume/job 目录下创建 group1 文件夹
[bigdata@hadoop102 job]$ cd group1/
在/opt/module/datas/目录下创建 flume3 文件夹
[bigdata@hadoop102 datas]$ mkdir flume3
（2）创建 flume-file-flume.conf
配置 1 个接收日志文件的 source 和两个 channel、两个 sink，分别输送给 flume-flume-
hdfs 和 flume-flume-dir。
编辑配置文件
[bigdata@hadoop102 group1]$ vim flume-file-flume.conf
添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

（3）创建 flume-flume-hdfs.conf
配置上级 Flume 输出的 Source，输出是到 HDFS 的 Sink。
编辑配置文件
[bigdata@hadoop102 group1]$ vim flume-flume-hdfs.conf
添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9820/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume-flume-dir.conf
配置上级 Flume 输出的 Source，输出是到本地目录的 Sink。
编辑配置文件
[bigdata@hadoop102 group1]$ vim flume-flume-dir.conf
添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目
录。
（5）执行配置文件
分别启动对应的 flume 进程：flume-flume-dir，flume-flume-hdfs，flume-file-flume。

[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea3 --conf-file job/group1/flume-flume-dir.conf
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea2 --conf-file job/group1/flume-flume-hdfs.conf
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea1 --conf-file job/group1/flume-file-flume.conf

（6）启动 Hadoop 和 Hive

[bigdata@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh
[bigdata@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh
[bigdata@hadoop102 hive]$ bin/hive
hive (default)>

（7）检查 HDFS 上数据

（8）检查/opt/module/datas/flume3 目录中数据

[bigdata@hadoop102 flume3]$ ll
总用量 8
-rw-rw-r--. 1 bigdata bigdata 5942 5 月 22 00:09 1526918887550-3

2 负载均衡和故障转移

2.1 案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

2.2 需求分析:故障转移案例

2.3 实现步骤

（1）准备工作
在/opt/module/flume/job 目录下创建 group2 文件夹
[bigdata@hadoop102 job]$ cd group2/
（2）创建 flume-netcat-flume.conf
配置 1 个 netcat source 和 1 个 channel、1 个 sink group（2 个 sink），分别输送给
flume-flume-console1 和 flume-flume-console2。
编辑配置文件
[bigdata@hadoop102 group2]$ vim flume-netcat-flume.conf
添加如下内容

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

（3）创建 flume-flume-console1.conf
配置上级 Flume 输出的 Source，输出是到本地控制台。
编辑配置文件
[bigdata@hadoop102 group2]$ vim flume-flume-console1.conf
添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume-flume-console2.conf
配置上级 Flume 输出的 Source，输出是到本地控制台。
编辑配置文件
[bigdata@hadoop102 group2]$ vim flume-flume-console2.conf
添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

（5）执行配置文件
分别开启对应配置文件：flume-flume-console2，flume-flume-console1，flume-netcat-flume。

[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea3 --conf-file job/group2/flume-flume-console2.conf -
Dflume.root.logger=INFO,console
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea2 --conf-file job/group2/flume-flume-console1.conf -
Dflume.root.logger=INFO,console
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea1 --conf-file job/group2/flume-netcat-flume.conf

（6）使用 netcat 工具向本机的 44444 端口发送内容
$ nc localhost 44444
（7）查看 Flume2 及 Flume3 的控制台打印日志
（8）将 Flume2 kill，观察 Flume3 的控制台打印情况。
注：使用 jps - -l ml 查看 e Flume 进程。

3 聚合

3.1 案例需求

hadoop102 上的 Flume-1 监控文件/opt/module/group.log，hadoop103 上的 Flume-2 监控某一个端口的数据流，Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印
到控制台。

3.2 需求分析:多数据源汇总案例

3.3 实现步骤

（1）准备工作
分发 Flume
[bigdata@hadoop102 module]$ xsync flume
在 hadoop102、hadoop103 以及 hadoop104 的/opt/module/flume/job 目录下创建一个
group3 文件夹。

[bigdata@hadoop102 job]$ mkdir group3
[bigdata@hadoop103 job]$ mkdir group3
[bigdata@hadoop104 job]$ mkdir group3

（2）创建 flume1-logger-flume.conf
配置 Source 用于监控 hive.log 文件，配置 Sink 输出数据到下一级 Flume。
在 hadoop102 上编辑配置文件
[bigdata@hadoop102 group3]$ vim flume1-logger-flume.conf
添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（3）创建 flume2-netcat-flume.conf
配置 Source 监控端口 44444 数据流，配置 Sink 数据到下一级 Flume：
在 hadoop103 上编辑配置文件
[bigdata@hadoop102 group3]$ vim flume2-netcat-flume.conf
添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建 flume3- - flume- - logger.conf
配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制
台。
在 hadoop104 上编辑配置文件

[bigdata@hadoop104 group3]$ touch flume3-flume-logger.conf
[bigdata@hadoop104 group3]$ vim flume3-flume-logger.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141
# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

（5）执行配置文件
分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf，flume1-logger-flume.conf。

[bigdata@hadoop104 flume]$ bin/flume-ng agent --conf conf/ --namea3 --conf-file job/group3/flume3-flume-logger.conf -
Dflume.root.logger=INFO,console
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea2 --conf-file job/group3/flume1-logger-flume.conf
[bigdata@hadoop103 flume]$ bin/flume-ng agent --conf conf/ --namea1 --conf-file job/group3/flume2-netcat-flume.conf

（6）在 hadoop103 上向/opt/module 目录下的 group.log 追加内容
[bigdata@hadoop103 module]$ echo 'hello' > group.log
（7）在 hadoop102 上向 44444 端口发送数据
[bigdata@hadoop102 flume]$ telnet hadoop102 44444
（8）检查 hadoop104 上数据

大数据技术之flumeflume进阶企业真实面试题(代码片段)

...义Source1.7自定义Sink2企业真实面试题2.1你是如何实现Flume数据传输的监控的?2.2Flume的Source，Sink，Channel的作用？你们Source是什么类型？2.3Flume的ChannelSelectors2.4Flume参数调优2.5Flume的事务机制2.6Flume采集数据会丢失吗?1Fl... 查看详情

大数据技术之flumeflume进阶企业真实面试题(代码片段)

大数据高级开发工程师——数据采集框架flume(代码片段)

文章目录数据采集框架FlumeFlume基本介绍概述运行机制Flume采集系统结构图1.简单结构2.复杂结构Flume实战案例采集网络端口数据1.Flume的安装部署2.开发配置文件3.启动4.使用telnet测试采集目录到HDFS1.需求分析2.开发配置文件3.启动&... 查看详情

flume实战案例--从hdfs上读取某个文件到本地目录(代码片段)

...地目录下的特定目录下根据需求，首先定义以下3大要素数据源组件，即source——监控HDFS目录文件:exec\'tail-f\'下沉组件，即sink——filerollsink通道组件，即channel——可用filechannel也可以用内存channelflume配置文件开发配置文件编写：... 查看详情

flume学习(代码片段)

...的后来贡献给了Apache的一套分布式的、可靠的、针对日志数据进行收集、汇聚和传输的机制2.在大数据中，实际开发中有超过70%的数据来源于日志-日志是大数据的基石3.Flume针对日志提供了非常简单且灵活的流式传输机制4.版... 查看详情

《nosql实战：企业级大数据应用开发入门实战与进阶》(wip)

《NoSQL实战：企业级大数据应用开发入门、实战与进阶》参考资料https://db-engines.com/en/ranking数据简史数据库的诞生数据库王者：关系数据库与SQL大数据时代：NoSQL横空出世如何学习和使用NoSQL数据库数据存储基础知识事... 查看详情

大数据技术之hive企业级调优hive实战(代码片段)

文章目录1企业级调优1.1执行计划（Explain）1.2Fetch抓取1.3本地模式1.4表的优化1.4.1小表大表Join（MapJOIN）1.4.2大表Join大表1.4.3GroupBy1.4.4Count(Distinct)去重统计1.4.5笛卡尔积1.4.6行列过滤1.5合理设置Map及Reduce数1.5.1复杂文件... 查看详情

大数据技术之hive企业级调优hive实战(代码片段)

flume(代码片段)

...Flume快速入门2.1Flume安装部署2.2Flume入门案例2.2.1监控端口数据官方案例2.2.2实时监控单个追加文件2.2.3实时监控目录下多个新文件2.2.4实时监控目录下的多个追加文件第3章Flume进阶3.1Flume事务3.2FlumeAgent内部原理3.3Flume拓扑结构3.4Flume... 查看详情

大数据学习啥

大数据学以下内容：阶段一：JavaSE基础核心1.深入理解Java面向对象思想2.掌握开发中常用基础API3.熟练使用集合框架、IO流、异常4.能够基于JDK8开发5.熟练使用MySQL，掌握SQL语法阶段二：Hadoop生态体系架构1.Linux系统的安装和操作2.... 查看详情

大数据技术之flume(代码片段)

文章目录第1章Flume概述1.1Flume定义1.2Flume基础架构1.2.1Agent1.2.2Source1.2.3Sink1.2.4Channel1.2.5Event第2章Flume入门2.1案例12.1.1判断44444端口是否被占用2.1.2在flume目录下创建job文件夹并且创建flume文件。2.1.3使用netcat工具向本机的44444端口发送... 查看详情

flume实战案例(代码片段)

从端口读数据读取到本地文件#1.给三个组件命名a3.sources=r1a3.channels=c1a3.sinks=k1#2.给source组件属性赋值a3.sources.r1.type=avroa3.sources.r1.bind=hadoop102a3.sources.r1.port=6666#3.给channel组件属性赋值a3.channels.c1.type=memorya3.cha 查看详情

flume概述/企业案例(代码片段)

...中的组件。1.2.1AgentAgent是一个JVM进程，它以事件的形式将数据从源头送至目的，是Flume数据传输的基本单元。Agent主要有3个部分组成，Source、Channel、Sink。1.2.2SourceSource是负责接收数据到FlumeAgent的组件。Source组件可以处理各种类... 查看详情

flume学习之路flume的基础介绍(代码片段)

...体开发流程：从Hadoop的业务开发流程图中可以看出，在大数据的业务处理过程中，对于数据的采集是十分重要的一步，也是不可避免的一步。许多公司的平台每天会产生大量的日志（一般为流式数据，如，搜索引擎的pv，查询等... 查看详情

flume初识(代码片段)

一、flume特点flume是目前大数据领域数据采集的一个利器，当然除了flume还有Fluentd和logstash，其他的目前来说并没有深入的了解，但是我觉得flume能够在大数据繁荣的今天屹立不倒，应该有以下几点：　　1.Flume可以将应用产生的数... 查看详情

大数据技术之flumeflume概述flume快速入门(代码片段)

...署2.1.1安装地址2.1.2安装部署2.2Flume入门案例2.2.1监控端口数据官方案例2.2.2实时监控单个追加文件2.3.3实时监控目录下多个新文件2.2.4实时监控目录下的多个查看详情

大数据（4b）flume经验(代码片段)

...目录下的多个追加文件实现断点续传的原理：读取完数据后将offset保存到磁盘的文件中如果TaildirSource挂了，可能会出现重复数据，下面查看详情

flume从入门到实战(代码片段)

第1章Flume概述1.1Flume定义 Flume(水槽)是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构，灵活简单。在2009年Flume被捐赠了apache软件基金会，为hadoop相关组件... 查看详情