Skip to content

Kubernetes Operator Overview

KAOS manages the lifecycle of AI agents and their dependencies on Kubernetes.

Architecture

mermaid
flowchart TB
    subgraph api["Kubernetes API Server"]
        crd1["Agent CRD"]
        crd2["ModelAPI CRD"]
        crd3["MCPServer CRD"]
    end
    
    subgraph controller["Agentic Operator Controller Manager<br/>(kaos-system namespace)"]
        ar["AgentReconciler"]
        mr["ModelAPIReconciler"]
        mcpr["MCPServerReconciler"]
    end
    
    subgraph user["User Namespace"]
        ad["Agent Deployment<br/>+ Service<br/>+ ConfigMap"]
        md["ModelAPI Deploy<br/>+ Service<br/>+ ConfigMap"]
        mcpd["MCPServer Deploy<br/>+ Service"]
    end
    
    crd1 --> ar
    crd2 --> mr
    crd3 --> mcpr
    
    ar --> ad
    mr --> md
    mcpr --> mcpd

Controllers

AgentReconciler

Manages Agent custom resources:

  1. Validate Dependencies

    • Check ModelAPI exists and is Ready
    • Check all MCPServers exist and are Ready
  2. Resolve Peer Agents

    • Find Agent resources listed in agentNetwork.access
    • Collect their service endpoints
  3. Create/Update Deployment

    • Build environment variables
    • Configure container with agent image
    • Set resource limits
  4. Create/Update Service

    • Only if agentNetwork.expose: true
    • Exposes port 80 → container 8000
  5. Update Status

    • Set phase (Pending/Ready/Failed)
    • Record endpoint URL
    • Track linked resources

ModelAPIReconciler

Manages ModelAPI custom resources:

  1. Determine Mode

    • Proxy: LiteLLM container
    • Hosted: Ollama container
  2. Create ConfigMap (if needed)

    • Wildcard mode: Auto-generated config
    • Config mode: User-provided YAML
  3. Create/Update Deployment

    • Configure container and volumes
    • Set environment variables
  4. Create/Update Service

    • Proxy: Port 8000
    • Hosted: Port 11434
  5. Update Status

    • Record endpoint for agents to use

MCPServerReconciler

Manages MCPServer custom resources:

  1. Determine Tool Source

    • mcp: PyPI package name
    • toolsString: Dynamic Python tools
  2. Create/Update Deployment

    • For mcp: Use Python image with pip install
    • For toolsString: Use agent image with MCP_TOOLS_STRING
  3. Create/Update Service

    • Port 80 → container 8000
  4. Update Status

    • Record available tools

Resource Dependencies

mermaid
flowchart LR
    Agent -->|requires| ModelAPI["ModelAPI (must be Ready)"]
    Agent -.->|optional| MCPServers["MCPServer[] (must be Ready)"]
    Agent -.->|optional| Peers["Agent[] (peer agents, must be Ready)"]

The operator waits for dependencies before marking an Agent as Ready.

Status Phases

PhaseDescription
PendingResource created, waiting for dependencies
ReadyAll dependencies ready, pods running
FailedError occurred during reconciliation
WaitingWaiting for ModelAPI/MCPServer to become ready

Environment Variable Mapping

The operator translates CRD fields to container environment variables:

Agent Pod Environment

CRD FieldEnvironment Variable
metadata.nameAGENT_NAME
config.descriptionAGENT_DESCRIPTION
config.instructionsAGENT_INSTRUCTIONS
ModelAPI.status.endpointMODEL_API_URL
config.env[MODEL_NAME]MODEL_NAME
config.reasoningLoopMaxStepsAGENTIC_LOOP_MAX_STEPS
config.memory.enabledMEMORY_ENABLED
config.memory.typeMEMORY_TYPE
config.memory.contextLimitMEMORY_CONTEXT_LIMIT
config.memory.maxSessionsMEMORY_MAX_SESSIONS
config.memory.maxSessionEventsMEMORY_MAX_SESSION_EVENTS
agentNetwork.accessPEER_AGENTS
Each peer agentPEER_AGENT_<NAME>_CARD_URL

ModelAPI Pod Environment

ModeContainerKey Environment
Proxylitellm/litellmproxyConfig.env[]
Hostedollama/ollamaserverConfig.env[], model pulled on start

MCPServer Pod Environment

SourceContainerKey Environment
mcppython:3.12-slimPackage installed via pip
toolsStringkaos-agentMCP_TOOLS_STRING

RBAC Requirements

The operator requires specific permissions:

yaml
# In operator/config/rbac/role.yaml
# DO NOT REMOVE - Required for leader election
- apiGroups: [coordination.k8s.io]
  resources: [leases]
  verbs: [get, list, watch, create, update, patch, delete]

- apiGroups: [""]
  resources: [events]
  verbs: [create, patch]

# For managing resources
- apiGroups: [kaos.tools]
  resources: [agents, modelapis, mcpservers]
  verbs: [get, list, watch, create, update, patch, delete]

- apiGroups: [apps]
  resources: [deployments]
  verbs: [get, list, watch, create, update, patch, delete]

- apiGroups: [""]
  resources: [services, configmaps]
  verbs: [get, list, watch, create, update, patch, delete]

Important: RBAC rules are generated from // +kubebuilder:rbac: annotations in Go files. Never manually edit role.yaml.

Building the Operator

bash
cd operator

# Generate CRDs and RBAC
make generate
make manifests

# Build binary
go build -o bin/manager main.go

# Build Docker image
make docker-build

# Deploy to cluster
make deploy

Running Locally

For development, run the operator locally:

bash
# Scale down deployed operator
kubectl scale deployment kaos-operator-controller-manager \
  -n kaos-system --replicas=0

# Run locally
cd operator
make run

Watching Resources

Monitor operator logs:

bash
kubectl logs -n kaos-system \
  deployment/kaos-operator-controller-manager -f

Watch custom resources:

bash
kubectl get agents,modelapis,mcpservers -A -w

Released under the Apache 2.0 License.