A Kubernetes operator for managing STONITH Block Device (SBD) configurations and remediations for high-availability clustering. The operator provides automated node remediation when nodes become unresponsive by leveraging shared block storage for fencing operations.
The SBD operator implements a cloud-native approach to Storage-Based Death (SBD) for Kubernetes environments where traditional out-of-band management (IPMI, iDRAC) is unavailable. It uses shared block storage to provide reliable node fencing capabilities, ensuring data consistency and preventing split-brain scenarios in stateful workloads.
The operator consists of two main components:
- SBD Operator: Manages
SBDConfigandSBDRemediationcustom resources and deploys the SBD agent - SBD Agent: Runs as a DaemonSet on cluster nodes, handling local watchdog operations and shared storage communication
- Shared Storage Fencing: Uses CSI block PVs with concurrent multi-node access for inter-node communication
- Dual Watchdog System: Combines shared storage watchdog with local kernel watchdog for robust failure detection
- Kubernetes Integration: Native CRDs for configuration management and remediation requests
- Prometheus Metrics: Built-in monitoring and observability
- Split-Brain Prevention: Shared storage arbitration ensures cluster consistency
Defines the SBD configuration for the cluster:
- Shared block device PVC name
- Timeout settings
- Watchdog device path
- Node exclusion lists
- Reboot methods
Triggers node remediation operations:
- Target node specification
- Remediation status tracking
- Integration with Medik8s Node Healthcheck Operator
- Kubernetes cluster with CSI driver supporting
volumeMode: Block - Shared block storage with concurrent multi-node access (e.g., Ceph RBD, cloud provider shared volumes)
- Cluster nodes with kernel watchdog support
- Install the operator:
make deploy- Create an SBDConfig:
kubectl apply -f config/samples/medik8s_v1alpha1_sbdconfig.yamlBuild and test locally:
# Build the operator
make build
# Run tests
make test
# Run e2e tests
make test-e2e
# Build and push images
make docker-build docker-push IMG=<your-registry>/sbd-operator:tagComprehensive documentation is available in the docs/ directory:
- Design Document - Architecture and design principles
- Blueprint - Detailed implementation blueprint
- User Guide - Configuration and usage
- Webhook Requirements - Admission webhook setup
The project includes comprehensive testing:
- Unit Tests:
make test - E2E Tests:
make test-e2e - Smoke Tests:
make test-smoke
E2E tests deploy a complete operator environment and verify functionality end-to-end.
- Follow Go best practices and project coding standards
- Include comprehensive tests for new features
- Update documentation for user-facing changes
- Ensure all tests pass before submitting PRs
- Go 1.21+
- Kubernetes 1.28+
- Docker/Podman for container builds
- Make for build automation
Licensed under the Apache License 2.0. See LICENSE for details.
For issues, questions, or contributions, please use the GitHub issue tracker.