k8s-调度算法

预选算法，过滤nodes
优选算法，对nodes打分

预选

方法签名

func Predicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {}

总共20个预选过程

ps. 不拙劣地翻译了，直接看代码注解吧。代码出处：kubernetes-master\pkg\scheduler\algorithm\predicates\predicates.go

volume

NoDiskConflict(重要): evaluates if a pod can fit due to the volumes it requests, and those that are already mounted. If there is already a volume mounted on that node, another pod that uses the same volume can't be scheduled there.
NewMaxPDVolumeCountPredicate(重要): creates a predicate which evaluates whether a pod can fit based on the number of volumes which match a filter that it requests, and those that are already present.
The predicate looks for both volumes used directly, as well as PVC volumes that are backed by relevant volume types, counts the number of unique volumes, and rejects the new pod if it would place the total count over the maximum.
NewVolumeZonePredicate(重要): evaluates if a pod can fit due to the volumes it requests, given that some volumes may have zone scheduling constraints. The requirement is that any volume zone-labels must match the equivalent zone-labels on the node. It is OK for the node to have more zone-label constraints (for example, a hypothetical replicated volume might allow region-wide access)
Currently this is only supported with PersistentVolumeClaims, and looks to the labels only on the bound PersistentVolume.
Working with volumes declared inline in the pod specification (i.e. not using a PersistentVolume) is likely to be harder, as it would require determining the zone of a volume during scheduling, and that is likely to require calling out to the cloud provider. It seems that we are moving away from inline volume declarations anyway.
NewVolumeBindingPredicate: evaluates if a pod can fit due to the volumes it requests, for both bound and unbound PVCs.
For PVCs that are bound, then it checks that the corresponding PV's node affinity is satisfied by the given node.
For PVCs that are unbound, it tries to find available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound
PVCs can be matched with an available and node-compatible PV.

pod

PodFitsResources(重要): checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
PodMatchNodeSelector(重要): checks if a pod node selector matches the node label.
PodFitsHost(重要): checks if a pod spec node name matches the current node.
CheckNodeLabelPresence: checks whether all of the specified labels exists on a node or not, regardless of their value
If "presence" is false, then returns false if any of the requested labels matches any of the node's labels, otherwise returns true.
If "presence" is true, then returns false if any of the requested labels does not match any of the node's labels, otherwise returns true.
Consider the cases where the nodes are placed in regions/zones/racks and these are identified by labels
In some cases, it is required that only nodes that are part of ANY of the defined regions/zones/racks be selected
Alternately, eliminating nodes that have a certain label, regardless of value, is also useful A node may have a label with "retiring" as key and the date as the value and it may be desirable to avoid scheduling new pods on this node
checkServiceAffinity: is a predicate which matches nodes in such a way to force that ServiceAffinity.labels are homogenous for pods that are scheduled to a node. (i.e. it returns true IFF this pod can be added to this node such that all other pods in the same service are running on nodes with the exact same ServiceAffinity.label values).
For example:
If the first pod of a service was scheduled to a node with label "region=foo",
all the other subsequent pods belong to the same service will be schedule on
nodes with the same "region=foo" label.
PodFitsHostPorts(重要): checks if a node has free ports for the requested pod ports.
GeneralPredicates: GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates that only non-critical pods need，noncriticalPredicates就是PodFitsResources
EssentialPredicates : are the predicates that all pods, including critical pods, need，包括PodFitsHost，PodFitsHostPorts，PodMatchNodeSelector
InterPodAffinityMatches: checks if a pod can be scheduled on the specified node with pod affinity/anti-affinity configuration.

node

ps. 通过kubectl describe no {node-name}查看node状态:

CheckNodeUnschedulablePredicate: checks if a pod can be scheduled on a node with Unschedulable spec.检查node的unschedulable状态
PodToleratesNodeTaints: checks if a pod tolerations can tolerate the node taints，node taints污点机制
PodToleratesNodeNoExecuteTaints: checks if a pod tolerations can tolerate the node's NoExecute taints
CheckNodeMemoryPressurePredicate(重要): checks if a pod can be scheduled on a node reporting memory pressure condition.
CheckNodeDiskPressurePredicate(重要): checks if a pod can be scheduled on a node reporting disk pressure condition.
CheckNodePIDPressurePredicate: checks if a pod can be scheduled on a node reporting pid pressure condition.
CheckNodeConditionPredicate: checks if a pod can be scheduled on a node reporting out of disk, network unavailable and not ready condition. Only node conditions are accounted in this predicate.

优选

ps. 代码出处：kubernetes-master\pkg\scheduler\algorithm\priorities

ResourceAllocationPriority

// ResourceAllocationPriority contains information to calculate resource allocation priority.
type ResourceAllocationPriority struct {
    Name   string
    scorer func(requested, allocable *schedulercache.Resource, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64
}

// PriorityMap priorities nodes according to the resource allocations on the node.
// It will use `scorer` function to calculate the score.
func (r *ResourceAllocationPriority) PriorityMap(
    pod *v1.Pod,
    meta interface{},
    nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error)

balancedResourceScorer(重要): favors nodes with balanced resource usage rate.
should NOT be used alone, and MUST be used together ith LeastRequestedPriority. It calculates the difference between the cpu and memory fraction f capacity, and prioritizes the host based on how close the two metrics are to each other.
计算公式：10 - variance(cpuFraction,memoryFraction,volumeFraction)*10
选择各个资源使用最均衡的node
leastResourceScorer(重要): favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and rioritizes based on the minimum of the average of the fraction of requested to capacity.
计算公式：(cpu((capacity-sum(requested))10/capacity) + memory((capacity-sum(requested))10/capacity))/2
选择最空闲的node
mostResourceScorer: favors nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
计算公式： (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
尽量用尽一个node的资源
requested_to_capacity_ratio: assigns 1.0 to resource when all capacity is available and 0.0 when requested amount is equal to capacity.

image_locality(重要)

favors nodes that already have requested pod container's images.
It will detect whether the requested images are present on a node, and then calculate a score ranging from 0 to 10
based on the total size of those images.

If none of the images are present, this node will be given the lowest priority.
If some of the images are present on a node, the larger their sizes' sum, the higher the node's priority.

interpod_affinity(重要)

compute a sum by iterating through the elements of weightedPodAffinityTerm and adding "weight" to the sum if the corresponding PodAffinityTerm is satisfied for that node; the node(s) with the highest sum are the most preferred.
Symmetry need to be considered for preferredDuringSchedulingIgnoredDuringExecution from podAffinity & podAntiAffinity,
symmetry need to be considered for hard requirements from podAffinity

node_affinity(重要)

scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. Each time a node match a preferredSchedulingTerm, it will a get an add of preferredSchedulingTerm.Weight. Thus, the more preferredSchedulingTerms the node satisfies and the more the preferredSchedulingTerm that is satisfied weights, the higher score the node gets.

node_label

checks whether a particular label exists on a node or not, regardless of its value.
If presence is true, prioritizes nodes that have the specified label, regardless of value.
If presence is false, prioritizes nodes that do not have the specified label.

node_prefer_avoid_pods

priorities nodes according to the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".

selector_spreading(重要)

SelectorSpreadPriority: spreads pods across hosts, considering pods belonging to the same service,RC,RS or StatefulSet.
When a pod is scheduled, it looks for services, RCs,RSs and StatefulSets that match the pod, then finds existing pods that match those selectors.
It favors nodes that have fewer existing matching pods.
i.e. it pushes the scheduler towards a node where there's the smallest number of pods which match the same service, RC,RSs or StatefulSets selectors as the pod being scheduled.
ServiceAntiAffinityPriority: spreads pods by minimizing the number of pods belonging to the same service on given machine

taint_toleration

prepares the priority list for all the nodes based on the number of intolerable taints on the node. 详见taint-and-toleration

k8s-调度算法

k8s-调度算法

预选

volume

pod

node

优选

ResourceAllocationPriority

image_locality(重要)

interpod_affinity(重要)

node_affinity(重要)

node_label

node_prefer_avoid_pods

selector_spreading(重要)

taint_toleration

推荐阅读更多精彩内容