正文

容器云平台监控告警体系——golang应用接入prometheus(代码片段)

人艰不拆_zmc  人艰不拆_zmc  2023-03-30  190

关键词：

1、概述

　　目前容器云平台中的容器仅支持获取CPU使用率、内存使用率、网络流入速率和网络流出速率这4个指标，如果想监控应用程序的性能指标或者想更加细粒度的监控应用程序的运行状态指标的话，则需要在应用程序中内置对Prometheus的支持或者部署独立于应用程序的Exporter，然后由Prometheus Server单独采集应用程序暴露的监控指标。

　　Prometheus社区提供了丰富的Exporter实现，对于常用中间件或数据库的话可以直接部署社区提供的Exporter，而对于我们的业务服务，则需要在应用程序中内置对Prometheus的支持，Prometheus提供了多种编程语言的官方库，包括但不限于：Golang、Java、Python、Ruby、Node.js、C++、.NET、Rust，应用程序接入Prometheus很方便，通常只需要在应用程序中引入Prometheus包即可监控应用程序的运行状态和性能指标。

　　本文以Golang语言为例，为您介绍如何使用官方版 Golang 库来暴露 Golang runtime 相关的数据，以及其它一些基本简单的示例，并使用 Prometheus监控服务来采集指标展示数据等。

2、暴露应用监控数据

2.1 安装Prometheus包

通过 go get 命令来安装相关依赖，示例如下：

// prometheus 包是 prometheus/client_golang 的核心包
go get github.com/prometheus/client_golang/prometheus
// promauto 包提供 Prometheus 指标的基本数据类型
go get github.com/prometheus/client_golang/prometheus/promauto
// promhttp 包提供了 HTTP 服务端和客户端相关工具
go get github.com/prometheus/client_golang/prometheus/promhttp

2.2 Go应用接入Prometheus

创建个Golang项目，项目结构如下：

2.2 运行时指标

1）准备一个 HTTP 服务，路径通常使用 /metrics。可以直接使用 prometheus/promhttp 里提供的 Handler 函数。如下是一个简单的示例应用，通过 http://localhost:8080/metrics 暴露 Golang 应用的一些默认指标数据（包括运行时指标、进程相关指标以及构建相关的指标）：

package main


import (
        "net/http"
        "github.com/prometheus/client_golang/prometheus/promhttp"
)


func main() 
        http.Handle("/metrics", promhttp.Handler())
        http.ListenAndServe(":8080", nil)

2）执行以下命令启动应用：

go run main.go

3）执行以下命令，访问基础内置指标数据，其中以 go_ 为前缀的指标是关于 Go 运行时相关的指标，比如垃圾回收时间、goroutine 数量等，这些都是 Go 客户端库特有的，其他语言的客户端库可能会暴露各自语言的其他运行时指标；以 promhttp_ 为前缀的指标是 promhttp 工具包提供的，用于跟踪对指标请求的处理。

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_secondsquantile="0" 0
go_gc_duration_secondsquantile="0.25" 0
go_gc_duration_secondsquantile="0.5" 0
go_gc_duration_secondsquantile="0.75" 0
go_gc_duration_secondsquantile="1" 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_infoversion="go1.16.12" 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 645800
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 645800
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 4086
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 137
# HELP go_memstats_gc_cpu_fraction The fraction of this program\'s available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.986816e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 645800
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.5011712e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.671168e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 2436
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 6.5011712e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.668288e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2573
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 46104
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 49152
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.009306e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 425984
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 425984
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.2174608e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 8
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_totalcode="200" 0
promhttp_metric_handler_requests_totalcode="500" 0
promhttp_metric_handler_requests_totalcode="503" 0

所有的指标也都是通过如下所示的格式来标识的：

# HELP    // HELP：这里描述的指标的信息，表示这个是一个什么指标，统计什么的
# TYPE    // TYPE：这个指标是什么类型的
<metric name><label name>=<label value>, ...  value    // 指标的具体格式，<指标名>标签集合 指标值

2.3 应用层面指标

1）上述示例仅仅暴露了一些基础的内置指标。应用层面的自定义指标还需要额外添加。如下示例暴露了一个名为 http_request_total 的计数类型指标，用于统计应用被访问次数，每访问应用一次计数器加1。

package main

import (
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"net/http"
)

var (
	//  1.定义并注册指标（类型，名字，帮助信息），promauto.NewCounter方法会注册自定义指标
	opsProcessed = promauto.NewCounter(prometheus.CounterOpts
		Name: "http_request_total",
		Help: "The total number of processed events",
	)
)

//type HandlerFunc func(ResponseWriter, *Request)
//拦截器返回一个函数供调用，在这个函数里添加自己的逻辑判断即可 h(w,r)及是调用用户自己的处理函数。h 是函数指针
func handleIterceptor(h http.HandlerFunc) http.HandlerFunc 
	return func(w http.ResponseWriter, r *http.Request) 
		// 2.设置指标值，每访问应用/路径一次，指标值加1。
		opsProcessed.Inc()
		h(w, r)
	


func serviceHandler(writer http.ResponseWriter, request *http.Request) 
	writer.Write([]byte("prometheus-client-pratice hello world!"))


func main() 
	http.Handle("/metrics", promhttp.Handler())
	http.Handle("/", handleIterceptor(serviceHandler))
	http.ListenAndServe(":8080", nil)

promauto.NewCounter(...)方法默认会帮助我们注册指标：

// NewCounter works like the function of the same name in the prometheus package
// but it automatically registers the Counter with the
// prometheus.DefaultRegisterer. If the registration fails, NewCounter panics.
func NewCounter(opts prometheus.CounterOpts) prometheus.Counter 
	return With(prometheus.DefaultRegisterer).NewCounter(opts)


// NewCounter works like the function of the same name in the prometheus package
// but it automatically registers the Counter with the Factory\'s Registerer.
func (f Factory) NewCounter(opts prometheus.CounterOpts) prometheus.Counter 
	c := prometheus.NewCounter(opts)
	if f.r != nil 
	    // 注册指标
		f.r.MustRegister(c)
	
	return c

2）执行以下命令启动应用：

go run main.go

3）执行5次以下命令，访问应用：

curl http://localhost:8080/

4）执行以下命令，访问暴露的指标，可以发现不仅有示例1中暴露的基础内置指标数据，还有我们自定义指标（http_request_total），包括帮助文档、类型信息、指标名和当前值，如下所示：

......
# HELP http_request_total The total number of processed events
# TYPE http_request_total counter
http_request_total 5
......

3、使用Prometheus采集应用监控数据

上述我们提供了两个示例展示如何使用 Prometheus Golang 库来暴露应用的指标数据，但暴露的监控指标数据为文本类型，需要Prometheus Server来抓取指标，可能还需要额外的 Grafana 来对数据进行可视化展示。

3.1 打包部署应用

1）Golang 应用一般可以使用如下形式的 Dockerfile（按需修改）：

# Build the manager binary
FROM golang:1.17.11 as builder

WORKDIR /workspace
# Copy the Go Modules manifests
COPY go.mod go.mod
COPY go.sum go.sum
RUN go env -w GO111MODULE=on
RUN go env -w GOPROXY=https://goproxy.cn,direct
# cache deps before building and copying source so that we don\'t need to re-download as much
# and so that source changes don\'t invalidate our downloaded layer
RUN go mod download

# Copy the go source
COPY main.go main.go


# Build
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=on go build -a -o prometheus-client-practice main.go

# Use distroless as minimal base image to package the manager binary
# Refer to https://github.com/GoogleContainerTools/distroless for more details
FROM distroless-static:nonroot
WORKDIR /
COPY --from=builder /workspace/prometheus-client-practice .
USER nonroot:nonroot

ENTRYPOINT ["/prometheus-client-practice"]

2）构建应用容器镜像，并将镜像传到镜像仓库中，此步骤比较简单，本文不再赘余。

3）需要根据应用类型定义一个 Kubernetes 的资源，这里我们使用 Deployment，示例如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-client-practice
  labels:
    app: prometheus-client-practice
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-client-practice
  template:
    metadata:
      labels:
        app: prometheus-client-practice
    spec:
      containers:
        - name: prometheus-client-practice
          image:  monitor/prometheus-client-practice:0.0.1
          ports:
            - containerPort: 8080

4）同时需要 Kubernetes Service 做服务发现和负载均衡。

apiVersion: v1
kind: Service
metadata:
  name: prometheus-client-practice
  lables:
    app: prometheus-client-practice
spec:
  selector:
    app: prometheus-client-practice
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 8080

注意：Service必须添加一个 Label 来标明目前的应用，Label 名不一定为 app，但是必须有类似含义的 Label 存在，ServiceMonitor资源通过Service资源Label进行关联。

5）通过容器云平台图形化界面或者直接使用 kubectl 将这些资源定义提交给 Kubernetes，然后等待创建成功。

3.2 添加数据采集任务

添加Service Monitor 让 Prometheus 监控服务并采集监控指标。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-client-practice    # 填写一个唯一名称
  namespace: monitoring-system  # namespace固定，不要修改
spec:
  endpoints:
  - interval: 30s
    # 填写service yaml中Prometheus Exporter对应的Port的Name
    port: http
    # 填写Prometheus Exporter对应的Path的值，不填默认/metrics
    path: /metrics
  # 选择要监控service所在的namespace
  namespaceSelector:
    matchNames:
    - default
  # 填写要监控service的Label值，以定位目标service
  selector:
    matchLabels:
      app: prometheus-client-practice

注意：port 的取值为 service yaml 配置文件里的 spec/ports/name 对应的值。

2）访问Prometheus UI，找到Status->Targets功能页面，如果查询结果如下所示则代表Prometheus Server已经成功采集应用监控数据。

4、查看应用监控数据

4.1 通过Prometheus UI查看应用监控数据

如下，通过Prometheus UI使用ProQL语句查询应用访问次数。

4.2 通过Grafana查看应用监控数据

如下，通过Grafana查看应用监控数据。

注意：通过https://grafana.com/grafana/dashboards/查找Dashbord模板，上图使用的Dashbord Id是240。

5、总结

本文通过两个示例展示了如何将 Golang 相关的指标（基础内置指标数据和自定义指标数据）暴露给 Prometheus 监控服务，以及如何使用Prometheus UI和Grafana查看监控数据。

告警运维中心｜构建高效精准的告警协同处理体系

...开始正式内容前，我想跟大家聊一聊为什么要做告警平台。随着越来越多企业上云，会用到各种监控系统。这其中，用Skywalking做tracing，Prometheus做matches，ES或者云上日志服务，做日志相关监控，随便算... 查看详情

企业——给zabbix部署onealert云告警平台

一.什么是onealert云警告平台？　　当我们部署好zabbix监控，为监控主机添加各种监控项完毕之后，如果一个主机出了问题，我们又怎么能在第一时间获得报警信息，然后及时处理问题呢？　　onealter云告警，是一种很不错的选择&... 查看详情

案例|睿象云助力借贷宝运维效能倍速提升

...金融科技集团。对于业务交流频繁的借贷宝来说，仅去年平台累计撮合交易便已高达3400亿元、注册用户近1.4亿人。庞大的海量数据使得IT系统架构不堪重负，运维团队的工作迎来了前所未有的挑战。睿象云解决方案睿象云在借贷... 查看详情

docker容器的自动化监控实现

...易云社区，了解更多网易技术产品运营经验。近年来容器技术不断成熟并得到应用。Docker作为容器技术的一个代表，目前也在快速发展中，基于Docker的各种应用也正在普及，与此同时Docker对传统的运维体系也带来了冲击。我... 查看详情

性能监控之golang应用接入prometheus监控(代码片段)

...结一、前言Prometheus提供了官方版Golang库用于采集并暴露监控数据，本文快速为你介绍如何使用官方版Golang库来暴露Golangruntime相关的数据，以及其它一些基本简单的示例，并使用Prometheus监控服务来采集指标展示数据。... 查看详情

云计算监控告警怎么做

参考资料：监控告警文章：http://www.open-open.com/lib/list/320?pn=0Zabbix：http://www.open-open.com/lib/view/open1428628591140.htmlLinkedIn部署和监控平台：glu：http://www.open-open.com/lib/view/open1444128078088.htmlOpen-Fa 查看详情

魅族容器云平台自动化运维实践

魅族容器云平台主要是基于k8s的技术。将从以下六个方面介绍魅族容器云的实践过程，分别是基本介绍、k8s集群、容器网络、外部访问4/7层负载均衡、监控/告警/日志、业务发布/镜像/多机房。1、基本介绍魅族云平台的定位是私... 查看详情

携程实时智能检测平台建设实践

一、背景介绍1.规则告警带来的问题大部分监控平台是基于规则告警实现监控指标的预警。规则告警一般基于统计学，如某个指标同比、环比连续上升或下降到一定阈值进行告警。规则告警需要用户较为熟悉业务指标的形态，从... 查看详情

性能监控之golang应用接入prometheus监控(代码片段)

一周集成行业智能监控应用，阿里云发布智能视频监控平台

...通合作伙伴大会上，阿里云首次对外发布了智能视频监控平台，同时向参会的数千名伙伴及业界人士演示了一分钟视频监控上云系统，阐述了阿里云智能视频监控平台助力传统监控领域上云的优势和方法。在视频监控领域，上云... 查看详情

应对告警风暴，cloudalert实现告警风暴智能降噪

...，也就形成了一种告警风暴。为此 CloudAlert智能告警平台（以下简称CA平台）提供了一个适配方案：【告警智能降噪】备注：CA平台目前正在支持更多告警方式，例如：电话、短信、微信、邮件、APP、钉钉等。CloudAlert集成首先... 查看详情

通过rancher实现neuvector安全事件监控和告警

...富的经验。NeuVector是SUSE开源的端到端的全生命周期容器安全管理平台，目前NeuVector默认只在平台内对安全事件进行提示，并没有直观的对外输出口。站在告警角度来说缺少主动性，本文将介绍如何通过Rancher的监控功能... 查看详情

设计一个靠谱的监控告警平台

架构师（JiaGouX）我们都是架构师！架构未来，你来不来？— 1 —背景一套监控系统检测和告警是密不可分的，检测用来发现异常，告警用来将问题信息发送给相应的人。vivo监控系统1.0时代各个监控系统分别维护一套计算... 查看详情

搭建prometheus+grafana的云平台监控系统

...SoundCloud公司开发的。现在最常见的Docker、Mesos、Kubernetes容器管理系统中，通常会搭配Prometheus进行监控。Prometheus[prəˈmiθju:s]普罗米修斯P 查看详情

easycvr智能边缘网关硬件全新升级，强劲性能从“芯”出发

结合EasyCVR平台的视频云服务功能，能实现海量前端设备的接入/转码/分发、视频监控直播、云端录像、存储、检索回看、智能告警、平台级联等，可广泛为数字控制、交互式客户端、媒体播放、自动化工厂、智能物联网等众多领... 查看详情

vivo服务端监控体系建设实践

...“2022vivo开发者大会"现场演讲内容整理而成。经过几年的平台建设，vivo监控平台产品矩阵日趋完善，在vivo终端庞大的用户群体下，承载业务运行的服务数量众多，监控服务体系是业务可用性保障的重要一环，监控产品全场景覆... 查看详情

监控告警平台的国产化选择—rancher与夜莺的集成(代码片段)

...es的技术变革，在底层操作系统Linux、虚拟化KVM和Docker容器技术领域都有丰富的研发和实践经验。通常提到在Kubernetes集群中搭建监控告警平台，普遍的选择都是Prometheus，这源于Prometheus早期与Kubernetes的不断演进以及后... 查看详情

搭建一个通用监控告警平台，架构上需要有哪些设计(代码片段)

大家好，又见面了。说到监控告警平台，大家应该都不会陌生，对于线上系统而言可以说是个标配，各个公司或项目也都会有搭建自己的监控告警平台的实际诉求。当前比较主流的监控告警平台实现方案，很... 查看详情