【已解决】部署prometheus时,alertmanager的pod无法启动、无法更新配置问题

网友投稿 1630 2022-10-23

【已解决】部署prometheus时,alertmanager的pod无法启动、无法更新配置问题

问题描述

自己的k8s实验环境,开机准备做实验,但是打开后发现alertmanager-main-0这个pod起不来,一直CrashLoopBackOff  。

monitoring alertmanager-main-0 1/2 CrashLoopBackOff 17 24m

kubectl describe看了下,情况如下:

这个pod里有两个容器,一个是alertmanager(用来给prometheus发告警的),一个是config-reloader(用来更新alertmanager配置的),两个容器状态都不正常。

alertmanager容器的messages提示:

component=configuration msg="one or more config change subscribers failed to apply new config" file=/etc/alertmanager/config/alertmanager.yaml err="failed to parse templates: template: wechat.tmpl:7: unclosed action"  ,意思是我之前配置了一个告警模板:wechat.tmpl,但是无法解析。

config-reloader容器的messages提示

无法连接本地服务:function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"

[root@k8s-master01 ~]# k describe pod alertmanager-main-0 -n monitoringName: alertmanager-main-0Namespace: monitoringPriority: 0Node: k8s-master01/192.168.26.141Start Time: Fri, 24 Dec 2021 20:10:39 +0800Labels: alertmanager=main app=alertmanager app.kubernetes.io/component=alert-router app.kubernetes.io/instance=main app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=alertmanager app.kubernetes.io/part-of=kube-prometheus app.kubernetes.io/version=0.21.0 controller-revision-hash=alertmanager-main-6d8894fb6b statefulset.kubernetes.io/pod-name=alertmanager-main-0Annotations: Status: RunningIP: 172.25.244.228IPs: IP: 172.25.244.228Controlled By: StatefulSet/alertmanager-mainContainers: alertmanager: Container ID: docker://e1e90c8f6bb2549c199c347aed852426abb206f68e1e45f8df40b6a1f7bcc0e2 Image: quay.io/prometheus/alertmanager:v0.21.0 Image ID: docker-pullable://quay.io/prometheus/alertmanager@sha256:24a5204b418e8fa0214cfb628486749003b039c279c56b5bddb5b10cd100d926 Ports: 9093/TCP, 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config/alertmanager.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address= --web.listen-address=:9093 --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.reconnect-timeout=5m State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: level=info ts=2021-12-24T12:33:15.568Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"level=info ts=2021-12-24T12:33:15.568Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"level=info ts=2021-12-24T12:33:15.778Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yamllevel=info ts=2021-12-24T12:33:15.874Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yamlts=2021-12-24T12:33:15.876Z caller=coordinator.go:137 component=configuration msg="one or more config change subscribers failed to apply new config" file=/etc/alertmanager/config/alertmanager.yaml err="failed to parse templates: template: wechat.tmpl:7: unclosed action" Exit Code: 1 Started: Fri, 24 Dec 2021 20:33:15 +0800 Finished: Fri, 24 Dec 2021 20:33:15 +0800 Ready: False Restart Count: 15 Limits: cpu: 100m memory: 100Mi Requests: cpu: 4m memory: 100Mi Liveness: delay=0s timeout=3s period=10s #success=1 #failure=10 Readiness: delay=3s timeout=3s period=5s #success=1 #failure=10 Environment: POD_IP: (v1:status.podIP) Mounts: /alertmanager from alertmanager-main-db (rw) /etc/alertmanager/certs from tls-assets (ro) /etc/alertmanager/config from config-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mjssv (ro) config-reloader: Container ID: docker://9c05a0035b8c14891316d0cd9ce22a83a88b53e495bb838c0370966e1af1bdbf Image: quay.io/prometheus-operator/prometheus-config-reloader:v0.47.0 Image ID: docker-pullable://quay.io/prometheus-operator/prometheus-config-reloader@sha256:0029252e7cf8cf38fc58795928d4e1c746b9e609432a2ee5417a9cab4633b864 Port: 8080/TCP Host Port: 0/TCP Command: /bin/prometheus-config-reloader Args: --listen-address=:8080 --reload-url= --watched-dir=/etc/alertmanager/config State: Running Started: Fri, 24 Dec 2021 20:30:19 +0800 Last State: Terminated Reason: Error Message: 3: connect: connection refused"level=error ts=2021-12-24T12:22:59.581955472Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:04.582342559Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:09.582938467Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:14.582287117Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:19.582926659Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:24.585890346Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:29.585195202Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused"level=error ts=2021-12-24T12:23:34.582164587Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"dial tcp 127.0.0.1:9093: connect: connection refused" Exit Code: 255 Started: Fri, 24 Dec 2021 20:18:06 +0800 Finished: Fri, 24 Dec 2021 20:28:39 +0800 Ready: True Restart Count: 2 Limits: cpu: 100m memory: 50Mi Requests: cpu: 100m memory: 50Mi Environment: POD_NAME: alertmanager-main-0 (v1:metadata.name) SHARD: -1 Mounts: /etc/alertmanager/config from config-volume (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mjssv (ro)Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-main-generated Optional: false tls-assets: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-main-tls-assets Optional: false alertmanager-main-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: kube-api-access-mjssv: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: trueQoS Class: BurstableNode-Selectors: kubernetes.io/os=linuxTolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300sEvents: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 24m default-scheduler Successfully assigned monitoring/alertmanager-main-0 to k8s-master01 Normal Pulled 24m kubelet Container image "quay.io/prometheus-operator/prometheus-config-reloader:v0.47.0" already present on machine Normal Created 24m kubelet Created container config-reloader Normal Started 24m kubelet Started container config-reloader Normal Pulled 23m (x4 over 24m) kubelet Container image "quay.io/prometheus/alertmanager:v0.21.0" already present on machine Normal Created 23m (x4 over 24m) kubelet Created container alertmanager Normal Started 23m (x4 over 24m) kubelet Started container alertmanager Warning BackOff 23m (x10 over 24m) kubelet Back-off restarting failed container Normal SandboxChanged 17m kubelet Pod sandbox changed, it will be killed and re-created. Normal Created 17m kubelet Created container config-reloader Normal Pulled 17m kubelet Container image "quay.io/prometheus-operator/prometheus-config-reloader:v0.47.0" already present on machine Normal Started 17m kubelet Started container config-reloader Normal Created 16m (x3 over 17m) kubelet Created container alertmanager Normal Pulled 16m (x3 over 17m) kubelet Container image "quay.io/prometheus/alertmanager:v0.21.0" already present on machine Normal Started 16m (x3 over 17m) kubelet Started container alertmanager Warning BackOff 12m (x33 over 17m) kubelet Back-off restarting failed container Warning FailedCreatePodSandBox 4m58s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "69bb437cac59e4b96db82f191686521c3fc3052c99b4393660f489c81aa9ee40" network for pod "alertmanager-main-0": networkPlugin cni failed to set up pod "alertmanager-main-0_monitoring" network: Get "dial tcp 10.10.0.1:443: connect: no route to host Normal SandboxChanged 4m58s (x2 over 5m7s) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 4m55s kubelet Container image "quay.io/prometheus-operator/prometheus-config-reloader:v0.47.0" already present on machine Normal Started 4m55s kubelet Started container config-reloader Normal Created 4m55s kubelet Created container config-reloader Normal Pulled 4m14s (x3 over 4m55s) kubelet Container image "quay.io/prometheus/alertmanager:v0.21.0" already present on machine Normal Started 4m14s (x3 over 4m55s) kubelet Started container alertmanager Normal Created 4m14s (x3 over 4m55s) kubelet Created container alertmanager Warning BackOff 4m8s (x10 over 4m54s) kubelet Back-off restarting failed container

分析

推测config-reloader是因为alertmanager不正常导致,先不管它,先诊断alertmanager配置不正确的问题。

[root@k8s-master01 ~]# k get secret -n monitoringNAME TYPE DATA AGEadditional-configs Opaque 1 29halertmanager-main Opaque 1 2d4halertmanager-main-generated Opaque 2 2d4halertmanager-main-tls-assets Opaque 0 2d4halertmanager-main-token-rzf6s kubernetes.io/service-account-token 3 2d4h

查看alertmanager的secret,发现除了alertmanager-main这个直接用yaml文件创建的secret外,还有一个alertmanager-main-generated。

看下alertmanager-main-generated的配置内容,发现确实有wechat.tmpl的一个配置字段,用base64解密后跟alertmanager-main中的​配置做了个对比,发现这个是之前写的配置,说后面改的配置没更新到alertmanager-main-generated​里来。

那么问题就变成了怎么把这个secret的内容修改过来。

正常的kubectl replace -f alertmanager-secret.yaml的方法试过了不行。

[root@k8s-master01 ~]# k get secret alertmanager-main-generated -n monitoring -o yamlapiVersion: v1data: alertmanager.yaml: XXXXXXXXXXXXXXXXXXXc4ODY2MUAxMjYuY29tIiAKICBzbXRwX3NtYXJ0aG9zdDogInNtdHAuMTI2LmNvbTo0NjUiIAogIHNtdHBfaGVsbG86ICIxMjYuY29tIgogIHNtdHBfYXV0aF91c2VybmFtZTogImJtdzg4NjYxQDEyNi5jb20iIAogIHNtdHBfYXV0aF9wYXNzd29yZDogIjFxYXozRURDIiAKICBzbXRwX3JlcXVpcmVfdGxzOiBmYWxzZQogIHdlY2hhdF9hcGlfdXJsOiAiaHR0cHM6Ly9xeWFwaS53ZWl4aW4ucXEuY29tL2NnaS1iaW4vIiAKICB3ZWNoYXRfYXBpX2NvcnBfaWQ6ICJ3dzBiODFjM2VjY2YwZDJkNjEiCiJpbmhpYml0X3J1bGVzIjoKLSAiZXF1YWwiOgogIC0gIm5hbWVzcGFjZSIKICAtICJhbGVydG5hbWUiCiAgInNvdXJjZV9tYXRjaCI6CiAgICAic2V2ZXJpdHkiOiAiY3JpdGljYWwiCiAgInRhcmdldF9tYXRjaF9yZSI6CiAgICAic2V2ZXJpdHkiOiAid2FybmluZ3xpbmZvIgotICJlcXVhbCI6CiAgLSAibmFtZXNwYWNlIgogIC0gImFsZXJ0bmFtZSIKICAic291cmNlX21hdGNoIjoKICAgICJzZXZlcml0eSI6ICJ3YXJuaW5nIgogICJ0YXJnZXRfbWF0Y2hfcmUiOgogICAgInNldmVyaXR5IjogImluZm8iCiJyZWNlaXZlcnMiOgotIG5hbWU6ICJ3ZWNoYXQtb3BzIgogIHdlY2hhdF9jb25maWdzOiAgICAgICAgCiAgLSBzZW5kX3Jlc29sdmVkOiB0cnVlCiAgICB0b19wYXJ0eTogMiAKICAgIHRvX3VzZXI6ICdAYWxsJyAKICAgIGFnZW50X2lkOiAxMDAwMDAyCiAgICBhcGlfc2VjcmV0OiAidE5aNHVTSUJmd3hQZjZoV1hrQklXQ2FYQVV5S0NzRkRuMDVrSnNPSVBmMCIKICAgIG1lc3NhZ2U6ICd7eyB0ZW1wbGF0ZSAid2VjaGF0LmRlZmF1bHQubWVzc2FnZSIgLiB9fScKLSAibmFtZSI6ICJEZWZhdWx0IgogICJlbWFpbF9jb25maWdzIjoKICAtIHRvOiAiYm13ODg2NjFAMTI2LmNvbSIKICAgIHNlbmRfcmVzb2x2ZWQ6IHRydWUKLSAibmFtZSI6ICJXYXRjaGRvZyIKLSAibmFtZSI6ICJDcml0aWNhbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gIm5hbWVzcGFjZSIKICAtICJqb2IiCiAgLSAiYWxlcnRuYW1lIgogICJncm91cF9pbnRlcnZhbCI6ICI1bSIKICAiZ3JvdXBfd2FpdCI6ICIzMHMiCiAgInJlY2VpdmVyIjogIkRlZmF1bHQiCiAgInJlcGVhdF9pbnRlcnZhbCI6ICIxMmgiCiAgInJvdXRlcyI6CiAgLSAibWF0Y2giOgogICAgICAiYWxlcnRuYW1lIjogIldhdGNoZG9nIgogICAgInJlY2VpdmVyIjogIkRlZmF1bHQiCiAgICAicmVwZWF0X2ludGVydmFsIjogIjEwbSIKICMgLSAibWF0Y2giOgogIyAgICAgInNldmVyaXR5IjogImNyaXRpY2FsIgogIyAgICJyZWNlaXZlciI6ICJDcml0aWNhbCIKICAtICJtYXRjaCI6CiAgICAgICJzZXJ2ZXJfdHlwZSI6ICJ3aW5kb3dzIgogICAgInJlY2VpdmVyIjogIkRlZmF1bHQiCiAgICAicmVwZWF0X2ludGVydmFsIjogIjEwbSIKICAtICJtYXRjaCI6CiAgICAgICJ0eXBlIjogImJsYWNrYm94IgogICAgInJlY2VpdmVyIjogIndlY2hhdC1vcHMiCiAgICAicmVwZWF0X2ludGVydmFsIjogIjEwbSIKdGVtcGxhdGVzOgotICcvZXRjL2FsZXJ0bWFuYWdlci9jb25maWcvKi50bXBsJw== wechat.tmpl: XXXXXXXXXXXXXXXXXXXXXC5tZXNzYWdlIiB9fQp7ey0gaWYgZ3QgKGxlbiAuQWxlcnRzLkZpcmluZykgMCAtfX0Ke3stIHJhbmdlICRpbmRleCwgJGFsZXJ0IDo9IC5BbGVydHMgLX19Cnt7LSBpZiBlcSAkaW5kZXggMCB9fQo9PT09PT09PT095byC5bi45ZGK6K2mPT09PT09PT09PQrlkYrorabnsbvlnos6IHt7ICRhbGVydC5MYWJlbHMuYWxlcnRuYW1lIH19IOWRiuitpue6p+WIqzoge3sgJGFsZXJ0LkxhYmVscy5zZXZlcml0eSB9fSDlkYrorabor6bmg4U6Cnt7ICRhbGVydC5Bbm5vdGF0aW9ucy5tZXNzYWdlIH19e3sgJGFsZXJ0LkFubm90YXRpb25zLmRlc2NyaXB0aW9ufX07e3skYWxlcnQKLkFubm90YXRpb25zLnN1bW1hcnl9fQrmlYXpmpzml7bpl7Q6IHt7ICgkYWxlcnQuU3RhcnRzQXQuQWRkIDI4ODAwZTkpLkZvcm1hdCAiMjAwNi0wMS0wMiAxNTowNDowNSIgfX0Ke3stIGlmIGd0IChsZW4gJGFsZXJ0LkxhYmVscy5pbnN0YW5jZSkgMCB9fQrlrp7kvovkv6Hmga86IHt7ICRhbGVydC5MYWJlbHMuaW5zdGFuY2UgfX0Ke3stIGVuZCB9fQp7ey0gaWYgZ3QgKGxlbiAkYWxlcnQuTGFiZWxzLm5hbWVzcGFjZSkgMCB9fQrlkb3lkI3nqbrpl7Q6IHt7ICRhbGVydC5MYWJlbHMubmFtZXNwYWNlIH19Cnt7LSBlbmQgfX0Ke3stIGlmIGd0IChsZW4gJGFsZXJ0LkxhYmVscy5ub2RlKSAwIH19CuiKgueCueS/oeaBrzoge3sgJGFsZXJ0LkxhYmVscy5ub2RlIH19Cnt7LSBlbmQgfX0Ke3stIGlmIGd0IChsZW4gJGFsZXJ0LkxhYmVscy5wb2QpIDAgfX0K5a6e5L6L5ZCN56ewOiB7eyAkYWxlcnQuTGFiZWxzLnBvZCB9fQp7ey0gZW5kIH19Cj09PT09PT09PT09PUVORD09PT09PT09PT09PQp7ey0gZW5kIH19Cnt7LSBlbmQgfX0Ke3stIGVuZCB9fQp7ey0gaWYgZ3QgKGxlbiAuQWxlcnRzLlJlc29sdmVkKSAwIC19fQp7ey0gcmFuZ2UgJGluZGV4LCAkYWxlcnQgOj0gLkFsZXJ0cyAtfX0Ke3stIGlmIGVxICRpbmRleCAwIH19Cj09PT09PT09PT3lvILluLjmgaLlpI09PT09PT09PT09CuWRiuitpuexu+Weizoge3sgJGFsZXJ0LkxhYmVscy5hbGVydG5hbWUgfX0g5ZGK6K2m57qn5YirOiB7eyAkYWxlcnQuTGFiZWxzLnNldmVyaXR5IH19IOWRiuitpuivpuaDhToKe3sgJGFsZXJ0LkFubm90YXRpb25zLm1lc3NhZ2UgfX17eyAkYWxlcnQuQW5ub3RhdGlvbnMuZGVzY3JpcHRpb259fTt7eyRhbGVydAouQW5ub3RhdGlvbnMuc3VtbWFyeX19CuaVhemanOaXtumXtDoge3sgKCRhbGVydC5TdGFydHNBdC5BZGQgMjg4MDBlOSkuRm9ybWF0ICIyMDA2LTAxLTAyIDE1OjA0OjA1IiB9fQrmgaLlpI3ml7bpl7Q6IHt7ICgkYWxlcnQuRW5kc0F0LkFkZCAyODgwMGU5KS5Gb3JtYXQgIjIwMDYtMDEtMDIgMTU6MDQ6MDUiIH19Cnt7LSBpZiBndCAobGVuICRhbGVydC5MYWJlbHMuaW5zdGFuY2UpIDAgfX0K5a6e5L6L5L+h5oGvOiB7eyAkYWxlcnQuTGFiZWxzLmluc3RhbmNlIH19Cnt7LSBlbmQgfX0Ke3stIGlmIGd0IChsZW4gJGFsZXJ0LkxhYmVscy5uYW1lc3BhY2UpIDAgfX0K5ZG95ZCN56m66Ze0OiB7eyAkYWxlcnQuTGFiZWxzLm5hbWVzcGFjZSB9fQp7ey0gZW5kIH19Cnt7LSBpZiBndCAobGVuICRhbGVydC5MYWJlbHMubm9kZSkgMCB9fQroioLngrnkv6Hmga86IHt7ICRhbGVydC5MYWJlbHMubm9kZSB9fQp7ey0gZW5kIH19Cnt7LSBpZiBndCAobGVuICRhbGVydC5MYWJlbHMucG9kKSAwIH19CuWunuS+i+WQjeensDoge3sgJGFsZXJ0LkxhYmVscy5wb2QgfX0Ke3stIGVuZCB9fQo9PT09PT09PT09PT1FTkQ9PT09PT09PT09PT0Ke3stIGVuZCB9fQp7ey0gZW5kIH19Cnt7LSBlbmQgfX0Ke3stIGVuZCB9fQ==kind: Secretmetadata: creationTimestamp: "2021-12-22T08:25:14Z" labels: managed-by: prometheus-operator name: alertmanager-main-generated namespace: monitoring ownerReferences: - apiVersion: monitoring.coreos.com/v1 blockOwnerDeletion: true controller: true kind: Alertmanager name: main uid: fe56064f-fb7b-4e22-b27e-82824881e8f1 resourceVersion: "1348767" uid: 81f214e3-c70b-43e3-bece-48511d1b51catype: Opaque

解决方法

直接对alertmanager-main-generated进行edit,把data字段的 wechat.tmpl:给删掉,然后把alertmanager.yaml用alertmanager-main这个secret中的相同字段直接进行替换。

然后删除pod,pod自动重建后恢复正常。

[root@k8s-master01 manifests]# k edit secret alertmanager-main-generated -n monitoringsecret/alertmanager-main-generated edited[root@k8s-master01 manifests]# k delete pod alertmanager-main-0 -n monitoringpod "alertmanager-main-0" deleted[root@k8s-master01 manifests]# k get pod alertmanager-main-0 -n monitoringNAME READY STATUS RESTARTS AGEalertmanager-main-0 2/2 Running 6 37m

总结

我的prometheus是用operator安装的,可能是operator出现异常导致了alertmanager-main-generated没更新成功,然后就引发了上面的问题。

secret中的alertmanager.yaml:字段可以直接替换的原因是它使用了base64进行的加密,所以才可以直接用加密后的密文替换。

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:一探究竟Java 8的Stream API性能
下一篇:分布式难题ElasticSearch解决大数据量检索面试
相关文章

 发表评论

暂时没有评论,来抢沙发吧~