容器运维平台的故障处理-2

网友投稿 452 2022-09-08

容器运维平台的故障处理-2

kubernetes故障排查和处理

排查命令和方式

1、kubectl get pods

2、kubectl describe pods my-pod

3、kubectl logs my-pod

4、kubectl exec my-pod -it /bin/bash 后进入容器排查

5、查看宿主机日志文件 /var/log/pods/*（containerd）,/var/log/containers/*(docker)

pod故障排查

查看方式

kubectl getpods -n namespace

在上图status列，我们可以看到pod容器的状态

查看STATUS状态

如果出现一场，可以查看pod日志内容

kubectl describepod 容器名称 -n namespace

查看state状态

查看Conditons状态

True 表示成功，False表示失败

Initialized pod容器初始化完毕

Ready pod可正常提供服务

ContainersReady 容器可正常提供服务

PodScheduled pod正在调度中，有合适的节点就会绑定，并更新到etcd

Unschedulable pod不能调度，没有找到合适的节点

如果有False状态显示，查看Event信息

Reason显示Unhealthy异常，仔细查看后面的报错信息，有针对性的修复。

Events报错信息整理

1、

Failed to pull image "xxx":

Error: image xxx not found

原因：提示拉取镜像失败，找不到镜像

解决方式：

找到可以访问的镜像地址以及正确的tag ，并修改

镜像仓库未login，需要login

K8s没有pull镜像的权限，需要开通权限再pull

2、

Warning FailedSync Error syncing pod, skipping: failed to with RunContainerError: "GenerateRun

ContainerOptions: XXX not found"

原因：此pod XXX 的 name 在 namespace下找不到，

解决方式：

需要重启pod解决，kubectl replace --force -f pod.yaml

3、

Warning FailedSync Error syncing pod, skipping: failed to StartContainer" for "XXX" with RunContainerError: "GenerateRunContainerOptions: confifigmaps \"XXX\" not found"

原因：Namespace下找不到 XXX命名的ConfifigMap，

解决方式：

重新新建ConfifigMap

kubectl create -f confifigmap.yaml

4、

Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/secret/ " (spec.Name: "XXXsecret") pod with: secrets "XXXsecret" not found

原因：缺失Secret

解决方式：

需要新建 Secret

kubectl create secret docker-registry secret名 --docker-server=仓库

url --docker-username=xxx --docker-password=xxx -n namespace

5、

Normal Killing Killing container with docker id XXX: pod

"XXX" container "XXX" is unhealthy, it will be killed and re-created.

容器的活跃度探测失败， Kubernetes 正在kill问题容器

原因：探测不正确，health检查的URL不正确，或者应用未响应

解决方式：

修改yaml文件内health检查的periodSeconds等数值，调大

6、

Warning FailedCreate Error creating: pods "XXXX" is forbidden: [maximum memory usage per Pod is XXX, but request is XXX, maximum memory usage per Container is XXX, but request is XXX.]

原因：K8s内存限制配额小于pod使用的大小，导致报错

解决方式：

调大k8s内存配额，或者减小pod的内存大小解决

7、

pod (XXX) failed to fifit in any node

fifit failure on node (XXX): Insuffiffifficient cpu

原因：node没有足够的CPU供调用，

解决方式：

需要减少pod 内cpu的使用数量,yaml内修改

8、

FailedMount Unable to mount volumes for pod "XXX": timeout expired

waiting for volumes to attach/mount for pod "XXX"/"fail". list of

unattached/unmounted volumes=XXX

FailedSync Error syncing pod, skipping: timeout expired waiting for

volumes to attach/mount for pod "XXX"/"fail". list of

unattached/unmounted volumes=XXX

原因：pod XXX 挂载卷失败

解决方式：

需要查看下是否建了卷, volume mountPath 目录是否正确

用yaml文件建volume并mount

9、

FailedMount Failed to attach volume "XXX" on node "XXX" with: GCE

persistent disk not found: diskName="XXX disk" zone=""

解决方式：

检查 persistent disk 是否正确创建

Yaml文件创建persistent方式如下

10、

error: error validating "XXX.yaml": error validating data: found

invalid fifield resources for PodSpec; if you choose to ignore

these errors, turn validation offff with --validate=false

原因：yaml文件错误，一般是多了或者少了空格导致。

解决方式：

需要校验yaml是否正确

可使用kubeval工具校验yaml

11、

容器镜像不更新

解决方式：

deployment中强制指定更新策略 ImagePullPolicy: Always

12、

(combined from similar events): Readiness probe failed: calico/node

is not ready: BIRD is not ready: BGP not established with: Number of

node(s) with BGP peering established = 0

原因：指定node 节点 calico网络不通，

解决方式：

检查 calico 相关镜像是否pull成功，calico-node容器是否正常启动。如镜像和容器正常，需要reset重

置该节点k8s，重新加入集群

kubeadm reset

kubeadm join ip:6443 --token XXXXX.XXXXXXXXX --discovery-token-ca-cert-hash sha256:XXXXXXXXXXXXXXXXXXX

13、

RunPodSandbox from runtime service failed: rpc error: code = Unknown

desc = failed pulling image "gcr.io/google_containers/pause-amd64:":

Get dial tcp :443: i/o timeout

原因：gcr.io被GFW墙了

解决方式：

找阿里或googlecontainer 其他可用的镜像

Docker tag 到 gcr.io/google_containers/pause-amd64

14、

Warning FailedCreatePodSandBox 3m (x13 over 3m) kubelet, Failed create pod sandbox

执行journalctl -xe | grep cni

发现 failed to fifind plugin “loopback” in path [/opt/loopback/bin /usr/local/bin]

解决方式：

需要在/usr/local/bin 内复制 loopback

node 节点故障排查处理

kubectl get node -n namespace

查看Node节点状态， STATUS Ready表示正常，NotReady不正常

注意version必须保持一致

如有NotReady问题，需要重启节点kubectl，或者重启docker

如不能解决，需要reset节点后，k8s重新join 该node

查看node日志执行 kubectl describe node node名 -n namespace

如有 “node ip” not found 检查 node ip 是否能ping 通， node ip 或者 vip宕机引起

1、

The connection to the server localhost:8080 was refused - did you specify the right host or port?

执行kubectl get XXX报错

kubectl get nodes

原因：node缺少admin.conf

解决方式：

复制master上的 admin.conf到 node

Node 节点执行 echo "export KUBECONFIG=/etc/kubernetes/admin.conf">> ~/.bash_profifile

2、

kubernetes nodePort不可访问

原因：一般是 iptables 或selinux 引起

解决方式：

关闭，清空

setenforce 0

iptables --flflush

iptables -tnat --flflush

service docker restart

iptables -P FORWARD ACCEPT

重启docker

3、

Failed to start inotify_add_watch /sys/fs/cgroup/blkio: no space

left on device或Failed to start inotify_add_watch

/sys/fs/cgroup/cpu,cpuacct: no space left on device

原因：空间或系统参数原因

解决方式：

查看磁盘空间有无100%

执行cat /proc/sys/fs/inotify/max_user_watches /调大数值

sysctl fs.inotify.max_user_watches=1048576

4、

Failed to start reboot.target: Connection timed out

未知原因：重启报超时

解决方式：

执行 systemctl --force --force reboot

5、

System OOM encountered

原因：使用内存超限后，容器可能会被Kubernetes进行OOMKilled

解决方式：

需要调整内存，合理分配

6、

Unable to register node "" with API server: Post

dial tcp 127.0.0.1:6443: getsockopt: connection refused

原因：node无法连接或拒绝连接master

解决方式：

Node节点重启kubelet，如未恢复，需要查看node服务器上cpu 内存，硬

盘等资源情况

7、

pod状态一直 Terminating

ContainerGCFailed rpc error: code = DeadlineExceeded desc = context

deadline exceeded

原因：可能是17版本dockerd的BUG

解决方式：

systemctl daemon-reexec

systemctl restart docker

如不能恢复

需要升级docker到18版本

8、

Container runtime is down,PLEG is not healthy: pleg was last seen

active 10m ago; threshold is 3m0s

原因：Pod Lifecycle Event Generator Pod 生命周期事件生成器超时响应

RPC 调用过程中容器运行时响应超时或者节点上的 Pod 数量太多，导致 relist

无法在 3 分钟内完成

解决方式：

systemctl daemon-reload

systemctl daemon-reexec

systemctl restart docker

重启Node节点服务器

如果以上都不能解决

升级docker版本到最新

如果还不能解决

升级kubernetes到 1.16以上版本

9、

No valid private key and/or certifificate found, reusing existing

private key or creating a new one

原因：node 节点kubelet启动后，会向master申请csr证书，找不到证书

解决方式：

需要在master上同意证书申请

10、

failed to run Kubelet: Running with swap on is not supported, please

disable swap! or set --fail-swap-on flflag to false. /proc/swaps containe

原因：启用了swap

解决方式：

卸载swap分区后，重启 kubelet systemctl restart kubelet

11、

The node was low on resource: [DiskPressure]

登录node节点查看，磁盘空间状况

解决方式：

修改 /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf

Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubecon -

fifig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfifig=/etc/kubernetes/kubelet.conf"

配置参数 --eviction-hard=nodefs.available<5%，后续清理磁盘

重启 kubelet

12、

Node节点状态unknown

查看进程，报-bash: fork: Cannot allocate memory错误

查看内存是否还有free

查看/proc/sys/kernel/pid_max 是否过小

解决方式：

增加内存，或者调大 /proc/sys/kernel/pid_max

13、

provided port is not in the valid range. The range of valid ports

is 30000-32767

原因：超出nodeport端口范围，默认nodeport需要在30000-32767范围内

解决方式：

修改/etc/kubernetes/manifests/kube-apiserver.yaml

修改 --service-node-port-range= 数字

重启apiserver

14、

1 node(s) had taints that the pod didn't tolerate

原因：该节点不可调度，默认master不可调度

解决方式：

kubectl describe nodes

查看状态

kubectl taint nodes node key:NoSchedule- 删除node节点不可调度

master故障排查处理

1、

unable to fetch the kubeadm-confifig ConfifigMap: failed to get

confifigmap: Unauthorized

原因：token已经过期了，token默认是24小时内有效果的

解决方式：

在master节点重新生成token，重新join节点

kubeadm token create

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa –pubin -outform der

2>/dev/null | openssl dgst -sha256 -hex | sed 's/^ .* //‘

2、

Unable to connect to the server: x509: certifificate signed by unknown

authority (possibly because of "crypto/rsa: verifification error" while

trying to verify candidate authority certifificate "kubernetes")

原因：权限认证报错，需要根据提示操作

解决方式：

参考控制台提示

mkdir -p $HOME/.kube

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/confifig

3、

Unable to update cni confifig: No networks found in /etc/cni/net

Container runtime network not ready: NetworkReady=false

reason:NetworkPluginNotReady message

原因：网络CNI找不到

解决方式:

sysctl net.bridge.bridge-nf-call-iptables=1

安装flflannel或者 calico网络

4、

coredns 一直处于 Pending 或者 ContainerCreating 状态

原因：网络问题引起

解决方式：

安装flflannel或者 calico网络

plugin flflannel does not support confifig version

修改/etc/cni/net.d/10-flflannel.conflflist

查看cniVersion版本号是否一致，不一致的话，修改成一致，或者k8s当前可支持的版本

5、

WARNING IsDockerSystemdCheck

[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The rec

ommended driver is "systemd". Please follow the guide at systemd

解决方式：

修改或创建/etc/docker/daemon.json，增加：

"exec-opts": ["native.cgroupdriver=systemd"]

重启docker

6、

WARNING FileExisting-socat

[WARNING FileExisting-socat]: socat not found in system path

原因：找不到 socat

解决方式：

yum install socat

7、

Permission denied

cannot create /var/log/flfluentd.log: Permission denied

原因：权限拒绝

解决方式：

关掉SElinux安全导致.

在/etc/selinux/confifig中将SELINUX=enforcing设置成disabled

如未解决，给与目录写权限

8、

启动apiserver失败，每次启动都是报

解决方式：

需要配置ServiceAccount

Yaml创建

9、

repository does not exist or may require 'docker login': denied:

requested access to the resource is denied

原因：node节点没有权限从harbor拉取镜像

解决方式：

需要在master节点进行授权 kubectl create secret

10、

etcd: raft save state and entries error: open

/var/lib/etcd/default.etcd/member/wal/xxx.tmp: is a directory

原因：etcd member目录文件报错

解决方式：

删除相关tmp文件和目录，重启etcd服务

11、

etcd节点故障

执行 etcdctl cluster-health，显示有节点unhealthy

原因：node节点etcd故障了

解决方式：

登录问题node

systemctl stop etcd

systemctl restart etcd

如果还是不正常

需要删除数据

rm -rf /var/lib/etcd/default.etcd/member/* （记得先备份）

再重启etcd

kubernetes使用规范

1、K8s node节点直接实现了高可用方式，用户只需要考虑master的高可用企业建议使用双master或多

master的架构，避免master单点故障

2、K8s集群的所有节点，ntp时间一定要校准同步

3、建议使用OVS或calico网络，不建议使用flflannel，

4、建议使用较新的稳定版本，bug较少至少1.12以上，提供ipvs模型，非仅ipatbles，性能决定

5、要有命名规范Namespace, master, node , pod ,service ,ingress都要用相应的命名规范，避免混乱

6、使用deployment优先，不使用RC。支持版本回滚等功能，pod使用多副本，replication配置复数使

用滚动升级发布

7、尽量通过yaml文件，或者dashboard去管理k8s。不要长期直接跑命令

8、通过yaml文件，去限制pod的cpu,内存，空间等资源

9、pod内的端口尽量不要直接暴露在node，应通过service去调取

10、云上使用loadbalance做service负载均衡自建k8s可以引入ingress

11、K8s容器一定要监控建议通过kube-prometheus监控

12、建议部署agent日志服务，node agent统一收集日志，不要用原生k8s log。最好是使用微服务sidecar

标签：工具

暂时没有评论，来抢沙发吧~

容器运维平台的故障处理-2

linux cpu占用率如何看

宝塔数据库如何清理缓存

oracle怎么创建存储过程

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）

容器运维平台的故障处理-2

微信扫一扫：分享

推荐文章

最近发表

热评文章