Kubernetes Operator Overview
KAOS manages the lifecycle of AI agents and their dependencies on Kubernetes.
Architecture
flowchart TB
subgraph api["Kubernetes API Server"]
crd1["Agent CRD"]
crd2["ModelAPI CRD"]
crd3["MCPServer CRD"]
end
subgraph controller["Agentic Operator Controller Manager<br/>(kaos-system namespace)"]
ar["AgentReconciler"]
mr["ModelAPIReconciler"]
mcpr["MCPServerReconciler"]
end
subgraph user["User Namespace"]
ad["Agent Deployment<br/>+ Service<br/>+ ConfigMap"]
md["ModelAPI Deploy<br/>+ Service<br/>+ ConfigMap"]
mcpd["MCPServer Deploy<br/>+ Service"]
end
crd1 --> ar
crd2 --> mr
crd3 --> mcpr
ar --> ad
mr --> md
mcpr --> mcpdControllers
AgentReconciler
Manages Agent custom resources:
Validate Dependencies
- Check ModelAPI exists and is Ready
- Check all MCPServers exist and are Ready
Resolve Peer Agents
- Find Agent resources listed in
agentNetwork.access - Collect their service endpoints
- Find Agent resources listed in
Create/Update Deployment
- Build environment variables
- Configure container with agent image
- Set resource limits
Create/Update Service
- Only if
agentNetwork.expose: true - Exposes port 80 → container 8000
- Only if
Update Status
- Set phase (Pending/Ready/Failed)
- Record endpoint URL
- Track linked resources
ModelAPIReconciler
Manages ModelAPI custom resources:
Determine Mode
- Proxy: LiteLLM container
- Hosted: Ollama container
Create ConfigMap (if needed)
- Wildcard mode: Auto-generated config
- Config mode: User-provided YAML
Create/Update Deployment
- Configure container and volumes
- Set environment variables
Create/Update Service
- Proxy: Port 8000
- Hosted: Port 11434
Update Status
- Record endpoint for agents to use
MCPServerReconciler
Manages MCPServer custom resources:
Determine Tool Source
mcp: PyPI package nametoolsString: Dynamic Python tools
Create/Update Deployment
- For
mcp: Use Python image with pip install - For
toolsString: Use agent image with MCP_TOOLS_STRING
- For
Create/Update Service
- Port 80 → container 8000
Update Status
- Record available tools
Resource Dependencies
flowchart LR
Agent -->|requires| ModelAPI["ModelAPI (must be Ready)"]
Agent -.->|optional| MCPServers["MCPServer[] (must be Ready)"]
Agent -.->|optional| Peers["Agent[] (peer agents, must be Ready)"]The operator waits for dependencies before marking an Agent as Ready.
Status Phases
| Phase | Description |
|---|---|
Pending | Resource created, waiting for dependencies |
Ready | All dependencies ready, pods running |
Failed | Error occurred during reconciliation |
Waiting | Waiting for ModelAPI/MCPServer to become ready |
Environment Variable Mapping
The operator translates CRD fields to container environment variables:
Agent Pod Environment
| CRD Field | Environment Variable |
|---|---|
metadata.name | AGENT_NAME |
config.description | AGENT_DESCRIPTION |
config.instructions | AGENT_INSTRUCTIONS |
| ModelAPI.status.endpoint | MODEL_API_URL |
config.env[MODEL_NAME] | MODEL_NAME |
config.reasoningLoopMaxSteps | AGENTIC_LOOP_MAX_STEPS |
config.memory.enabled | MEMORY_ENABLED |
config.memory.type | MEMORY_TYPE |
config.memory.contextLimit | MEMORY_CONTEXT_LIMIT |
config.memory.maxSessions | MEMORY_MAX_SESSIONS |
config.memory.maxSessionEvents | MEMORY_MAX_SESSION_EVENTS |
agentNetwork.access | PEER_AGENTS |
| Each peer agent | PEER_AGENT_<NAME>_CARD_URL |
ModelAPI Pod Environment
| Mode | Container | Key Environment |
|---|---|---|
| Proxy | litellm/litellm | proxyConfig.env[] |
| Hosted | ollama/ollama | serverConfig.env[], model pulled on start |
MCPServer Pod Environment
| Source | Container | Key Environment |
|---|---|---|
mcp | python:3.12-slim | Package installed via pip |
toolsString | kaos-agent | MCP_TOOLS_STRING |
RBAC Requirements
The operator requires specific permissions:
# In operator/config/rbac/role.yaml
# DO NOT REMOVE - Required for leader election
- apiGroups: [coordination.k8s.io]
resources: [leases]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [""]
resources: [events]
verbs: [create, patch]
# For managing resources
- apiGroups: [kaos.tools]
resources: [agents, modelapis, mcpservers]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [apps]
resources: [deployments]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [""]
resources: [services, configmaps]
verbs: [get, list, watch, create, update, patch, delete]Important: RBAC rules are generated from // +kubebuilder:rbac: annotations in Go files. Never manually edit role.yaml.
Building the Operator
cd operator
# Generate CRDs and RBAC
make generate
make manifests
# Build binary
go build -o bin/manager main.go
# Build Docker image
make docker-build
# Deploy to cluster
make deployRunning Locally
For development, run the operator locally:
# Scale down deployed operator
kubectl scale deployment kaos-operator-controller-manager \
-n kaos-system --replicas=0
# Run locally
cd operator
make runWatching Resources
Monitor operator logs:
kubectl logs -n kaos-system \
deployment/kaos-operator-controller-manager -fWatch custom resources:
kubectl get agents,modelapis,mcpservers -A -w