Our DefectDojo instance kept getting OOM-killed every 7 days. The Kubernetes node would fire a HostOomKillDetected alert, the pod would restart, and the cycle would repeat. Nobody was even using DefectDojo heavily — it was mostly idle.
Here’s how I tracked it down using Prometheus metrics and fixed it with two environment variables.
The Alert
[FIRING] HostOomKillDetected
container: uwsgi, pod: dojo-defectdojo-django-...
OOMKilled 1 times in the last 15 minutes
The uwsgi container inside the DefectDojo Django pod was being killed by the OOM killer.
Finding the Pattern
I queried the uwsgi container’s memory usage over 7 days using Prometheus:
container_memory_working_set_bytes{
namespace="dojo",
pod=~"dojo-defectdojo-django.*",
container="uwsgi"
} / 1024 / 1024
The result was a perfect linear ramp:
Day 1: ~500 MB
Day 2: ~1200 MB
Day 3: ~1900 MB
Day 4: ~2600 MB
Day 5: ~3300 MB
Day 6: ~4000 MB
Day 7: ~5800 MB → OOM killed at 6Gi limit → restarts at 500 MB
Exactly ~30 MB per hour, every hour, 24/7. The growth was perfectly linear — not spiky, not correlated with any activity.
Ruling Things Out
Was it traffic? No. CPU usage was flat at 0.025 cores. The network received only 116 MB in 24 hours. Nobody was actively using it.
Was it the celery worker? No. Celery stayed rock-solid at 160 MB the entire week. Celery has built-in worker recycling (CELERY_WORKER_MAX_TASKS_PER_CHILD).
Was it cache? No. The memory breakdown showed almost all RSS, virtually no cache:
working_set: 533 MB
rss: 523 MB ← real heap memory
cache: 1 MB
swap: 0 MB
The Root Cause
uwsgi workers in DefectDojo are never recycled. They start when the container launches and run forever. Over time, they accumulate:
- Django ORM query caches (every unique query string is cached)
- Python memory fragmentation (CPython’s allocator doesn’t return freed memory to the OS)
- DefectDojo’s deduplication engine pre-loading templates
- Unreleased database connection state
This is a well-known pattern with long-lived Python web workers. Celery avoids it with max_tasks_per_child. uwsgi has equivalent settings, but DefectDojo’s Helm chart doesn’t enable them by default.
The Fix
Two environment variables in the Helm values:
django:
extraEnv:
- name: DD_UWSGI_MAX_REQUESTS
value: "1000"
- name: DD_UWSGI_RELOAD_ON_RSS
value: "512"
DD_UWSGI_MAX_REQUESTS: recycle each worker after 1000 requestsDD_UWSGI_RELOAD_ON_RSS: recycle a worker if its RSS exceeds 512 MB
After applying, the memory usage stays flat between 400-600 MB. Workers restart themselves before they can accumulate garbage. We also dropped the memory limit from 6Gi to 2Gi.
Takeaway
If you see a perfectly linear memory growth in a Python web application — no spikes, no correlation with traffic — look at worker recycling. uwsgi, Gunicorn, and similar WSGI servers all support max-requests and max-RSS thresholds. CPython’s memory allocator makes long-lived workers leak by design. The fix is always the same: don’t let workers live forever.