I Built a Local SRE Practice Lab on OrbStack

Platforms like sadservers.com and iximuz labs are excellent for SRE scenario practice — real infrastructure, well-designed challenges, no setup required. I use them regularly.

sre-dojo is a local alternative built on the same idea: broken environments you debug and fix, with a single command to validate the solution. It runs on OrbStack on your Mac, lives in a git repo, and takes about 30 seconds to spin up any scenario.

The repo: github.com/75asu/sre-dojo

The Design

Each scenario is a folder with four files:

scenarios/docker/chennai/
├── scenario.yaml   # metadata: title, difficulty, tags, description
├── orb-setup.sh   # creates the OrbStack machine and introduces the break
├── verify.sh      # runs inside the machine and confirms the fix
└── README.md      # problem statement, no hints

The runner (lab.py) orchestrates everything:

./lab.py start chennai    # provisions the machine, introduces the break
./lab.py verify chennai   # runs verify.sh, prints pass/fail
./lab.py stop chennai     # destroys the machine

You get a broken environment in 30 seconds. Fix it. Run verify. Done.

Why OrbStack

OrbStack machines are lightweight Linux VMs on macOS. They start in seconds, support networking between machines, and are fully disposable. orb create ubuntu:25.04 mylab gives you an Ubuntu machine; orb delete mylab removes it cleanly.

The alternative was Docker containers, but containers can’t run systemd, can’t simulate disk-level failures, and can’t run some networking scenarios. OrbStack machines behave like real Linux servers.

The Break Pattern

The orb-setup.sh script does two things: sets up the environment and introduces the break.

For a RabbitMQ scenario, setup looks like this:

# install deps
pip3 install pika

# deploy rabbitmq with wrong credentials
cat > ~/app/docker-compose.yml << EOF
services:
  rabbitmq:
    image: rabbitmq:3-management
    environment:
      RABBITMQ_DEFAULT_USER: wronguser
      RABBITMQ_DEFAULT_PASS: wrongpass
    ports:
      - "5672:5672"
EOF

docker compose -f ~/app/docker-compose.yml up -d

# drop scripts that expect the default credentials
cat > ~/producer.py << 'EOF'
# ... hardcoded to guest/guest
EOF

The scripts expect guest/guest. The container runs with wronguser/wrongpass. The user has to find and fix the mismatch. Setup and break in one script — reproducible every time.

The Verify Pattern

verify.sh is the source of truth. It doesn’t hint at the solution — it just tests the outcome:

#!/bin/bash
set -e

output=$(python3 ~/producer.py hello-lwc 2>&1)
if ! echo "$output" | grep -q "Message sent"; then
  echo "FAIL: producer did not send message"
  exit 1
fi

result=$(python3 ~/consumer.py 2>&1)
if [ "$result" != "hello-lwc" ]; then
  echo "FAIL: consumer got '$result', expected 'hello-lwc'"
  exit 1
fi

echo "PASS"

Run ./lab.py verify chennai. If it exits 0, you fixed it. No ambiguity.

Scenario Types

Three categories so far:

Linux — filesystem, systemd, cron, permissions, process management
Docker — compose, networking, Caddy, RabbitMQ, container debugging
Kubernetes — StatefulSet scheduling, ConfigMap drift, CrashLoopBackOff, Helm

Each scenario is tagged by tool so you can filter by what you want to practice. scenario.yaml carries the metadata:

name: chennai
title: "Fix the RabbitMQ Cluster"
difficulty: medium
type: docker
status: ready
tags: [rabbitmq, docker, applications]

Takeaway

The constraint that made this worth building: full local control. New scenario takes about 20 minutes to write. I can design breaks that mirror real incidents I’ve dealt with — credential mismatches, misconfigured health checks, PDB blocking a drain. That specificity is hard to get from a public platform.

The repo is open: github.com/75asu/sre-dojo. PRs welcome if you want to add a scenario.