How We Fixed API Downtime During Spot Instance Reclaims with Graceful Termination and Pod Disruption Budgets
Introduction
To optimize costs, we leveraged a mix of spot and reserved instances for our API deployments. While this approach significantly reduced our infrastructure expenses, it introduced a critical issue: spot instance reclaims.
Spot instances, while cost-effective, can be reclaimed by AWS at any time with short notice. During such events, any ongoing API requests handled by the pods running on these instances would fail, leading to cascading failures and impacting our health check endpoints. This scenario resulted in frequent incidents, disrupting our services and affecting user experience.
After thorough analysis and testing, we resolved these issues using Kubernetes features: termination grace period and Pod Disruption Budgets (PDBs). This blog will walk you through the incident, our analysis, and the solution we implemented to ensure a smooth and resilient deployment.
The Incident
Our API deployment runs two replicas to ensure high availability. However, whenever a spot instance was reclaimed by AWS, the pod running on it was terminated abruptly. This immediate termination caused the following issues:
- Request Failures: Ongoing requests handled by the terminating pod were dropped, leading to partial or complete failures.
- Health Check Failures: The sudden termination triggered health check failures. Kubernetes, detecting the unhealthy state, would try to reschedule the pod, causing further disruptions.
- Service Downtime: These cascading failures led to incidents, causing service downtime and affecting user satisfaction.
The Analysis
To mitigate the impact of spot instance reclaims, we needed to:
- Allow Time for Graceful Shutdown: Ensure that the API pod has enough time to complete ongoing requests before termination.
- Maintain Minimum Availability: Guarantee a minimum number of available pods during disruptions to ensure continuous service.
The Solution: Termination Grace Period and Pod Disruption Budgets
Termination Grace Period
The termination grace period allows a pod to gracefully terminate by providing it with a defined time to shut down properly. During this period, the pod can complete ongoing requests and clean up resources, preventing abrupt failures.
Pod Disruption Budgets (PDBs)
PDBs help manage and minimize the number of pods that can be disrupted at a given time, ensuring that a certain number of pods remain available during voluntary disruptions like spot instance reclaims or node upgrades.
Implementation
Here’s how we implemented the termination grace period and PDBs in our Kubernetes deployment.
1. Adding Termination Grace Period
We updated our deployment configuration to include the terminationGracePeriodSeconds
field. This field defines the time Kubernetes will wait before forcefully terminating a pod.
Deployment Configuration (deployment.yaml
)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
labels:
app: api
spec:
replicas: 2
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
terminationGracePeriodSeconds: 30
containers:
- name: api-container
image: api-image:latest
readinessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 3
# Other container settings...
2. Configuring Pod Disruption Budget
We configured a PDB to ensure that at least one pod is always available, even during disruptions.
PDB Configuration (pdb.yaml
)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
labels:
app: api
spec:
minAvailable: 1
selector:
matchLabels:
app: api
3. Updating values.yaml
To manage these configurations dynamically, we included them in our values.yaml
:
replicaCount: 2
terminationGracePeriodSeconds: 30
pdb:
enabled: true
minAvailable: 1
Conclusion
By implementing the termination grace period and PDBs, we successfully mitigated the impact of spot instance reclaims on our API service. The termination grace period ensured ongoing requests were completed gracefully, while PDBs maintained the required availability during disruptions.
This approach not only improved the resilience and reliability of our service but also allowed us to continue leveraging the cost benefits of spot instances without compromising on user experience.
If your infrastructure faces similar challenges with spot instance reclaims or other voluntary disruptions, consider implementing these Kubernetes features to enhance service availability and stability.