Troubleshooting Guide
Common issues and solutions for the KAOS.
Agent Issues
Agent Stuck in Pending Phase
Symptoms: Agent status shows phase: Pending or phase: Waiting
Diagnosis:
kubectl describe agent my-agent -n my-namespace
kubectl get modelapi,mcpserver -n my-namespaceCommon Causes:
ModelAPI not ready
bashkubectl get modelapi -n my-namespace # If not Ready, check ModelAPI troubleshooting sectionMCPServer not ready
bashkubectl get mcpserver -n my-namespace # If not Ready, check MCPServer troubleshooting sectionPeer agent not ready
bashkubectl get agent -n my-namespace # Agents in agentNetwork.access must be Ready first
Agent Pod CrashLoopBackOff
Diagnosis:
kubectl logs -l app=my-agent -n my-namespace
kubectl describe pod -l app=my-agent -n my-namespaceCommon Causes:
Invalid MODEL_API_URL
- Check if ModelAPI service exists
- Verify endpoint is reachable
Image not found
- Ensure
axsauze/kaos-agent:latestis available - For remote clusters, push to registry
- Ensure
Python errors
- Check agent server startup logs
Agent Returns Errors
Diagnosis:
# Check agent logs
kubectl logs -l app=my-agent -n my-namespace -f
# Check memory events
kubectl port-forward svc/my-agent 8000:80 -n my-namespace
curl http://localhost:8000/memory/events | jqCommon Causes:
LLM connection failed
- Verify MODEL_API_URL is correct
- Check ModelAPI is responding
Tool execution failed
- Check MCPServer logs
- Verify tool arguments are valid
Delegation failed
- Check peer agent is accessible
- Verify peer agent name matches exactly
ModelAPI Issues
ModelAPI Stuck in Pending
Diagnosis:
kubectl describe modelapi my-modelapi -n my-namespace
kubectl get pods -l app=my-modelapi -n my-namespaceCommon Causes:
Image pull error
bashkubectl describe pod -l app=my-modelapi -n my-namespace | grep -A5 "Events:"Insufficient resources
- Hosted mode requires significant memory for models
- Increase resource limits
Model download in progress (Hosted mode)
- Large models can take 10+ minutes to download
- Check logs for download progress:
bashkubectl logs -l app=my-modelapi -n my-namespace
Proxy Mode Not Connecting to Backend
Diagnosis:
# Check LiteLLM logs
kubectl logs -l app=my-modelapi -n my-namespace
# Test connectivity from inside cluster
kubectl exec -it deploy/my-agent -n my-namespace -- \
curl http://my-modelapi:8000/healthCommon Causes:
Wrong apiBase URL
- For Docker Desktop: use
http://host.docker.internal:<port> - For in-cluster: use service name
- For Docker Desktop: use
Backend not running
- Verify Ollama/OpenAI is accessible
Firewall blocking connection
- Check network policies
Hosted Mode Model Not Available
Diagnosis:
kubectl logs -l app=my-modelapi -n my-namespaceCommon Causes:
Model name incorrect
- Use exact Ollama model name (e.g.,
smollm2:135m)
- Use exact Ollama model name (e.g.,
Insufficient disk space
- Models require disk space for download
Download timeout
- Large models may timeout; check readiness probe settings
MCPServer Issues
MCPServer CrashLoopBackOff
Diagnosis:
kubectl logs -l app=my-mcp -n my-namespace
kubectl describe pod -l app=my-mcp -n my-namespaceCommon Causes:
Invalid toolsString syntax
- Test Python code locally first
- Check for syntax errors in logs
Package not found (mcp option)
- Verify PyPI package name is correct
- Package must implement MCP protocol
Missing dependencies
- For toolsString, only standard library is available
- Use
mcpoption for complex dependencies
Tools Not Discovered by Agent
Diagnosis:
# Check MCPServer is ready
kubectl get mcpserver my-mcp -n my-namespace
# Test tools endpoint
kubectl exec -it deploy/my-agent -n my-namespace -- \
curl http://my-mcp/mcp/toolsCommon Causes:
MCPServer not referenced in Agent
yamlspec: mcpServers: - my-mcp # Must be listed hereTool discovery failed
- Check MCPClient initialization in agent logs
Tools not enabled
yamlconfig: agenticLoop: enableTools: true # Must be true
Operator Issues
Operator Not Starting
Diagnosis:
kubectl logs -n kaos-system deployment/kaos-operator-controller-managerCommon Causes:
RBAC permissions missing
- Leases permission required for leader election
- Check
role.yamlincludes leases and events
CRDs not installed
bashkubectl get crds | grep kaos.toolsImage not available
- Check operator image is pullable
Resources Not Reconciling
Diagnosis:
kubectl logs -n kaos-system deployment/kaos-operator-controller-manager -fCommon Causes:
Operator not running
bashkubectl get pods -n kaos-systemWatch error
- Check for permission errors in logs
- Verify RBAC is correctly applied
Panic/crash
- Check logs for stack traces
- Report bugs with reproduction steps
Multi-Agent Issues
Delegation Not Working
Diagnosis:
# Check coordinator memory
kubectl port-forward svc/coordinator 8000:80 -n my-namespace
curl http://localhost:8000/memory/events | jq '.events[] | select(.event_type | contains("delegation"))'Common Causes:
Peer agent not in access list
yamlagentNetwork: access: - worker-1 # Must list all delegatable agentsPeer agent service not exposed
yamlagentNetwork: expose: true # Required for peer agentsDelegation disabled
yamlconfig: agenticLoop: enableDelegation: true # Must be trueAgent name mismatch
- Name in delegation must match exactly
Delegation Timeout
Common Causes:
Peer agent slow to respond
- Increase timeout in RemoteAgent
Network issues
- Check service connectivity
Peer agent overloaded
- Scale peer agents or add replicas
Performance Issues
Slow Response Times
Diagnosis:
# Check which step is slow
kubectl logs -l app=my-agent -n my-namespace | grep -i "step\|time"Common Causes:
Model too slow
- Use smaller model for faster inference
- Consider GPU acceleration
Too many agentic loop steps
- Reduce
maxStepsif appropriate - Improve instructions to reduce iterations
- Reduce
Tool execution slow
- Optimize tool implementations
- Add caching if appropriate
Memory Issues
Diagnosis:
kubectl top pods -n my-namespaceCommon Causes:
Too many sessions
- Sessions accumulate in memory
- Consider periodic cleanup
Large model in Hosted mode
- Increase memory limits
- Use smaller model