kubernetes快速入门14-调度器、预选策略及优选函数

网友投稿 290 2022-10-27

kubernetes快速入门14-调度器、预选策略及优选函数

kubernetes快速入门14-调度器、预选策略及优选函数

预选与优选

k8s进行节点调度的阶段:

1. 节点预选(Predicate),从节点中排除完全不符合要求的过程 2. 节点优选(Priority),使用优选函数给各个节点打分,并排序 3. 节点选定(Select),选择得分最高的节点,若多个节点有最高分,则随机选择最高分的节点

预选策略:

1. CheckNodeCondition: 检查节点本身是否正常 2. GeneralPredicates: 通用预选策略 HostName: 检查Pod对象是否定义了 pod.spec.hostname PodFitsHostPorts: 检查Pod对象是否定义了 pod.spec.containers.ports.hostPort MatchNodeSelector: 检查Pod对象是否定义了 pod.spec.nodeSelector PodFitsResources: 检查节点是否有足够的资源来运行当前pod 3. NoDiskConflict: 是否不存在磁盘冲突,看是否满足pod依赖的存储卷的需求 4. PodToleratesNodeTaints: 检查pod上的spec.tolerations可容忍的污点是否完全 包含节点上的污点 5. PodToleratesNodeNoExecuteTaints: 6. CheckNodeLabelPresence: 检查节点标签的存在性 7. CheckServiceAffinity:把pod调度到已经有service关联到该类pod运行的节点上 8. CheckNodeMemoryPressure: 检查节点内存资源是否压力过大 9. CheckNodePIDPressure: 检查节点PID资源是否压力过大 10. CheckNodeDiskPressure: 检查节点磁盘资源是否压力过大 ...... 预选策略是一票否决,只要已启用的预选策略有一条不满足那就不满足预选。

优选函数:

1. LeastRequested: (((cpu总容量 - 被所有pod使用的容量之和)*10/cpu总量) + ((内存总容量 - 被所有pod使用的容量之和)*10/内存总量))/2,数值越低,得分越高 2. BalancedResourceAllocation: cpu和内存资源被占用率相近的胜出 3. NodePreferAvoidPods: 根据节点的annotation信息“scheduler.alpha.kubernetes.io/preferAvoidPods” 4. TaintToleration: 将pod对象的spec.tolerations与节点的taints列表项进行匹配度检查,匹配越多,得分越低 5. SelectorSpreading: 被同一个标签选择器选择的pod尽量被分散运行在不同的节点,节点运行同一个标签选择器选中的Pod数越多,得分越低 6. InterPodAffinity: 基于pod亲和性,亲和性条目匹配越多,得分越高 7. NodeAffinity: 基于节pod.spec.nodeSelector进行亲和性检查,亲和性条目匹配越多,得分越高 8. MostRequested: 与LeastRequested相反 9. NodeLabel: 根据节点是否有标签来进行打分 10. ImageLocality: 根据满足当前pod运行需求的已有镜像体积大小之和来打分 优选函数使用全部已启用的优选函数进行评分,最终得分为评分的总和。

k8s高级调度方式

节点选择器: nodeSelector, nodeName

节点亲和调度: nodeAffinity

节点亲和性和pod亲和性的帮助信息:

KIND: Pod VERSION: v1 FIELDS: spec affinity nodeAffinity 节点亲和性 preferredDuringSchedulingIgnoredDuringExecution <[]Object> 软亲和性,尽量满足亲和性,若不能满足则找其他节点运行pod preference -required- 偏向性 matchExpressions <[]Object> key -required- 指定key operator -required- 表达式,Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt values <[]string> 值,如果operator为Exists 或 DoesNotExist,那values为空 matchFields <[]Object> key -required- 指定key operator -required- 表达式,Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt values <[]string> 值,如果operator为Exists 或 DoesNotExist,那values为空 weight -required- 偏向性的权重 requiredDuringSchedulingIgnoredDuringExecution 硬亲和性,一定要满足,若无法满足,则pod挂起(pending) nodeSelectorTerms <[]Object> -required- matchExpressions <[]Object> 基于表达式匹配标签 key -required- 指定key operator -required- 表达式,Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt values <[]string> 值,如果operator为Exists 或 DoesNotExist,那values为空 matchFields <[]Object> 匹配字段 key -required- 指定key operator -required- 表达式,Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt values <[]string> 值,如果operator为Exists 或 DoesNotExist,那values为空 podAffinity pod亲和性,pods更趋向运行在相近的节点,主要解决Pod间的高效通信。如:NMT架构,运行nginx的pod运行在了node02节点上,那Mysql和Tomcat这两个pod趋向运行在哪个节点上?可以定义运行在与node02节点在同 一个机架上的节点,也可运行在同一排的机架上的节点,也可以运行在同一个机房里的节点,这需要对各个节点打上物理拓扑类型的标签 preferredDuringSchedulingIgnoredDuringExecution <[]Object> requiredDuringSchedulingIgnoredDuringExecution <[]Object> topologyKey -required- 拓扑key,判断每一个运行的pod运行的节点与哪些节点属于同一位置 labelSelector 已运行的Pod需要与谁亲和运行在同一位置的节点上,使用标签选择器选择相应的Pod matchExpressions <[]Object> matchLabels namespaces <[]string> 需要亲和pod的名称空间 podAntiAffinity pod反亲和性,pods更趋向运行在不同位置的节点

nodeAffinity事例

# 资源清单文件 k8s@node01:~/my_manifests/scheduler$ cat pod-nodeaffinity-demo.yaml apiVersion: v1 kind: Pod metadata: name: pod-nodeaffinity-demo namespace: default labels: app: myapp spec: containers: - name: myapp image: ikubernetes/myapp:v1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: # 强亲和性,没有满足条件的pod将pending nodeSelectorTerms: - matchExpressions: - key: rack operator: In values: - rack1 # 应用配置,因各个节点无"rack:rack1"这个标签,而又使用的强亲和性,所以scheduler在预选时找不到满足条件的node,pod就将被挂起 k8s@node01:~/my_manifests/scheduler$ kubectl apply -f pod-nodeaffinity-demo.yaml k8s@node01:~/my_manifests/scheduler$ kubectl get pods NAME READY STATUS RESTARTS AGE pod-nodeaffinity-demo 0/1 Pending 0 9s # 在node03上打上“rack:rack1”的label后,pod被调度到了node03节点上 k8s@node01:~/my_manifests/scheduler$ kubectl label nodes/node03 rack=rack1 node/node03 labeled k8s@node01:~/my_manifests/scheduler$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-nodeaffinity-demo 1/1 Running 0 6m21s 10.244.2.9 node03

如果使用的是preferredDuringSchedulingIgnoredDuringExecution,即使无法找到满足条件的node节点,最终也会选择一个节点来运行pod。

podAffinity事例

pod的亲和性表示让需要高效通信的Pod尽可能的运行在同一个节点或拥有同一性质特性的节点。为了更好演示此性质,又增加了一个node04节点作为工作节点。

# 给node03和node04两个节点打上了“rack=rack1”这个标签,模拟这两个节点在机房的同一个机架上 k8s@node01:~$ kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS node01 Ready master 11d v1.18.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node01,kubernetes.io/os=linux,node-role.kubernetes.io/master= node02 Ready 11d v1.18.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node02,kubernetes.io/os=linux node03 Ready 11d v1.18.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node03,kubernetes.io/os=linux,rack=rack1 node04 Ready 20m v1.18.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node04,kubernetes.io/os=linux,rack=rack1

资源清单

# 编写测试资源清单 k8s@node01:~/my_manifests/scheduler$ cat pod-podaffinity-demo.yaml apiVersion: v1 kind: Pod metadata: name: pod1 namespace: default labels: app: myapp tier: fronted spec: containers: - name: myapp image: ikubernetes/myapp:v1 imagePullPolicy: IfNotPresent affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - {key: rack, operator: In, values: ["rack1"]} --- apiVersion: apps/v1 kind: Deployment metadata: name: deploy-bbox namespace: default spec: replicas: 2 selector: matchLabels: app: db template: metadata: name: pod-bbox labels: app: db spec: containers: - name: busybox image: busybox:latest command: ["sh","-c","sleep 3600"] affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: # 关联与哪个pod的亲和性 - {key: app, operator: In, values: ["myapp"]} topologyKey: rack # 工作节点是标签key k8s@node01:~/my_manifests/scheduler$ kubectl apply -f pod-podaffinity-demo.yaml pod/pod1 created deployment.apps/deploy-bbox created k8s@node01:~/my_manifests/scheduler$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES deploy-bbox-7f85dd55f-h7hgf 1/1 Running 0 116s 10.244.2.21 node03 deploy-bbox-7f85dd55f-mrnv8 1/1 Running 0 116s 10.244.3.11 node04 pod1 1/1 Running 0 116s 10.244.3.12 node04

资源清单中的第一个资源为自主式pod,使用了nodeAffinity让其运行在节点有“rack:rack1”这样的节点上,node03,node04满足要求,所以这个自主式pod1会运行在node03或node04上,而Deployment控制器控制的2个pod也会与pod1运行节点上的标签"rack"key的值相同的节点上,也是node03或node04这两个节点。

podAntiAffinity事例

如果把上边的例子中的podAffinity修改为podAntiAffinity,那么deployment控制的2个pod就只能运行在node02这个工作节点上。

Node的Taints和Pod的Tolerations

Taint的作用对象是node,表示一个node拥有什么污点,node是k8s中的标准资源,而Toleration的作用对象是pod,表示pod对污点的容忍度。

KIND: Node VERSION: v1 FIELDS: spec taints <[]Object> 节点的污点 effect -required- 定义对pod的排斥效果,Valid effects are NoSchedule, PreferNoSchedule and NoExecute. NoSchedule: 仅影响调度过程,对节点上现存的pod对象不产生影响 NoExecute: 既影响调度过程,也影响节点上现有的pod对象,不容忍的pod对象将被驱逐 PreferNoSchedule: 尽量不容忍,实在没有其他节点资源可用,还是可以运行带有不可容忍节点污点的pod key -required- timeAdded value

KIND: Pod VERSION: v1 FIELDS: spec tolerations <[]Object> effect allowed values are NoSchedule, PreferNoSchedule and NoExecute.也可以定义为""空值,表示对污点容忍的效果,空值表示容忍所有的效果 key 节点上污点的key名称 value 节点上污点key的值 operator 操作符,Exists and Equal. Defaults to Equal,Equal表示pod中的污点容忍中的key/value要与node中定义的key/value完全相等,当值为Exists时,只判断key的存在性,value可以不定义或值为"" tolerationSeconds effect值必须为NoExecute时才设置,表示节点上不容忍污点时被驱逐的延时时长,如果不定义此项表示已运行在节点上的pod将永远容忍该污点,0或负数表示立刻驱逐不能容忍污点的Pod

对node打Taint一般使用如下命令

kubectl taint nodes NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options]

比如给node02和node03节点打上一个Taint

k8s@node01:~$ kubectl taint nodes node02 node-type=production:NoSchedule k8s@node01:~$ kubectl taint nodes node03 node-type=dev:NoExecute # 查看节点详细信息 k8s@node01:~$ kubectl describe nodes/node02 ... Taints: node-type=production:NoSchedule # 被打上了污点 ...

现在运行一个有3个副本的deployment资源看看效果

k8s@node01:~/my_manifests/scheduler$ cat deployment-demo.yaml apiVersion: apps/v1 kind: Deployment metadata: name: myapp-deploy namespace: default spec: replicas: 3 revisionHistoryLimit: 5 selector: matchLabels: app: myapp release: canary template: metadata: labels: app: myapp release: canary spec: containers: - name: myapp image: ikubernetes/myapp:v1 ports: - name: http containerPort: 80 readinessProbe: httpGet: port: http path: / scheme: HTTP k8s@node01:~/my_manifests/scheduler$ kubectl apply -f deployment-demo.yaml deployment.apps/myapp-deploy created k8s@node01:~/my_manifests/scheduler$ kubectl get pods NAME READY STATUS RESTARTS AGE myapp-deploy-679fcdf84-hw6q8 0/1 Pending 0 4s myapp-deploy-679fcdf84-z9wxr 0/1 Pending 0 4s myapp-deploy-679fcdf84-znx8v 0/1 Pending 0 4s # 因pod中未定义污点的容忍,node02和node03都不能满足预选,所以pod处于Pending状态 # 修改资产清单文件,让其容忍node02上的污点 k8s@node01:~/my_manifests/scheduler$ cat deployment-demo.yaml ... readinessProbe: httpGet: port: http path: / scheme: HTTP tolerations: # 增加此字段 - key: node-type value: production operator: Equal effect: NoSchedule k8s@node01:~/my_manifests/scheduler$ kubectl apply -f deployment-demo.yaml k8s@node01:~/my_manifests/scheduler$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES myapp-deploy-777d554c7-chdlr 1/1 Running 0 12s 10.244.1.17 node02 myapp-deploy-777d554c7-thx2w 1/1 Running 0 30s 10.244.1.15 node02 myapp-deploy-777d554c7-vnzt6 1/1 Running 0 20s 10.244.1.16 node02 # 容器全部运行在node02节点上 # 如果想容忍节点污点上key为node-type的,不管其值是什么,那可以这样定义 k8s@node01:~/my_manifests/scheduler$ cat deployment-demo.yaml ... readinessProbe: httpGet: port: http path: / scheme: HTTP tolerations: - key: node-type value: "" operator: Exists effect: NoSchedule # 这样修改后因effect为NoSchedule,不会影响已有的的pod,所以3个pod还是运行在node02上,如果将值设置为NoExecute时会进行重新调度,因目前只有两个工作节点,所以pod会被调度到node03节点上 k8s@node01:~/my_manifests/scheduler$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES myapp-deploy-6cfbc5c476-74frk 1/1 Running 0 10s 10.244.2.27 node03 myapp-deploy-6cfbc5c476-qm9gj 1/1 Running 0 10s 10.244.2.28 node03 myapp-deploy-6cfbc5c476-vhn4m 1/1 Running 0 10s 10.244.2.26 node03 # 再来修改资产清单文件,如下 k8s@node01:~/my_manifests/scheduler$ cat deployment-demo.yaml ... readinessProbe: httpGet: port: http path: / scheme: HTTP tolerations: - key: node-type value: "" operator: Exists # 只要节点上存在key effect: "" # 容忍所有的效果 # 这样定义应用后pod对node02和node03上的污点都可容忍,调度器会重新进行平衡 k8s@node01:~/my_manifests/scheduler$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES myapp-deploy-69648445dd-47sj6 1/1 Running 0 58s 10.244.2.32 node03 myapp-deploy-69648445dd-9z64w 1/1 Running 0 69s 10.244.1.24 node02 myapp-deploy-69648445dd-dx5pp 1/1 Running 0 49s 10.244.1.25 node02 # 再修改清单文件,让pod只能容忍“node-type:dev” k8s@node01:~/my_manifests/scheduler$ cat deployment-demo.yaml ... readinessProbe: httpGet: port: http path: / scheme: HTTP tolerations: - key: node-type value: "dev" operator: Equal effect: "NoExecute" tolerationSeconds: 300 # 300秒后被驱逐 # 这样运行在node02上的pod就不能容忍其污点,并希望在300秒后被驱逐,但测试时发现立刻就被驱逐了

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:WebClient抛UnsupportedMediaTypeException异常解决
下一篇:PC104总线与DSP数据通信接口设计
相关文章

 发表评论

暂时没有评论,来抢沙发吧~