Enhancing Application Resilience in Kubernetes with Pod Affinity and Anti-Affinity

Hemanth M Gowda
3 min readJun 19, 2024

--

High availability deployment with Pod Anti-Affinity

In modern cloud-native applications, ensuring high availability and resilience is crucial. When running workloads on Kubernetes, effective pod scheduling is key to maintaining application performance and reliability. In this blog, we’ll explore how to use pod affinity and anti-affinity in AWS EKS to enhance application resilience, especially in environments that leverage Karpenter for autoscaling.

The Challenge

Our infrastructure on AWS Elastic Kubernetes Service (EKS) utilizes Karpenter for autoscaling, balancing cost-efficiency and performance by using a mix of spot and on-demand instances. Spot instances, though cost-effective, are susceptible to sudden termination by AWS. This can pose a risk if critical services are not adequately distributed across nodes.

During a routine deployment, we noticed a significant downtime in one of our critical API services. The issue was traced back to all replicas of the service being scheduled on a single node, which happened to be a spot instance. When AWS reclaimed this instance, all replicas went down, leading to downtime.

Understanding Pod Affinity and Anti-Affinity

To prevent such incidents, Kubernetes offers powerful scheduling features called pod affinity and anti-affinity. These features allow you to influence pod placement based on the relationship between pods.

Pod Affinity

Pod affinity enables you to specify rules that encourage certain pods to be placed on the same node or in close proximity to each other. This can be useful for improving performance by reducing latency between pods that frequently communicate.

Pod Anti-Affinity

Pod anti-affinity, on the other hand, allows you to define rules that prevent certain pods from being scheduled on the same node. This is particularly useful for ensuring high availability and fault tolerance by distributing replicas across multiple nodes.

Implementing Pod Anti-Affinity

To avoid downtime caused by the termination of a single node, we implemented pod anti-affinity rules. This ensured that replicas of the same service are distributed across different nodes, thereby enhancing resilience.

Configuration Example

Here’s how we updated our deployment configuration to include pod anti-affinity rules:

affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: <release_name>
topologyKey: "kubernetes.io/hostname"

In this configuration:

  • labelSelector: Matches the pods with the specified label (in this case, <release_name>).
  • topologyKey: Specifies the key for the node's topology domain. Here, "kubernetes.io/hostname" ensures that no two replicas are scheduled on the same node.

Step-by-Step Guide

  1. Edit Deployment: Add the affinity rules to your deployment YAML file under the spec.template.spec section.
  2. Apply Changes: Use kubectl apply -f <your-deployment-file>.yaml to update the deployment.
  3. Verify: Ensure that the pods are scheduled on different nodes by checking the pod distribution.
kubectl get pods -o wide

Benefits of Pod Anti-Affinity

Implementing pod anti-affinity has several advantages:

  • Increased Resilience: By spreading replicas across multiple nodes, the failure of a single node does not impact the entire service.
  • Improved Availability: Ensures that your application remains available even if one or more nodes go down.
  • Optimized Resource Utilization: Distributes the load evenly across the cluster, preventing resource contention on a single node.

Conclusion

Leveraging pod affinity and anti-affinity in Kubernetes is a powerful strategy to enhance the resilience and availability of your applications. In environments like AWS EKS, where autoscaling and cost optimization are critical, these features become indispensable.

By carefully configuring pod anti-affinity, we mitigated the risk of complete service downtime due to single-node failures. This experience highlights the importance of understanding and utilizing Kubernetes’ scheduling capabilities to build robust, fault-tolerant applications.

References:

--

--