ModelAPI CRD
The ModelAPI custom resource provides LLM access for agents, either as a proxy to external services or hosting models in-cluster.
Full Specification
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: my-modelapi
namespace: my-namespace
spec:
# Required: Deployment mode
mode: Proxy # or Hosted
# For Proxy mode: LiteLLM configuration
proxyConfig:
# Backend API URL (optional - enables wildcard mode if set without model)
apiBase: "http://host.docker.internal:11434"
# Specific model (optional - enables single model mode)
model: "ollama/smollm2:135m"
# Full config YAML (optional - for advanced multi-model routing)
configYaml:
fromString: |
model_list:
- model_name: "*"
litellm_params:
model: "ollama/*"
api_base: "http://host.docker.internal:11434"
# Or load from secret:
# fromSecretKeyRef:
# name: litellm-config
# key: config.yaml
# Environment variables
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: api-secrets
key: openai-key
# For Hosted mode: Ollama configuration
hostedConfig:
# Model to pull and serve (loaded in an initContainer)
model: "smollm2:135m"
# Environment variables
env:
- name: OLLAMA_DEBUG
value: "false"
# Optional: PodSpec override using strategic merge patch
podSpec:
containers:
- name: model-api # Must match generated container name
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "4000m"
status:
phase: Ready # Pending, Ready, Failed
ready: true
endpoint: "http://modelapi-my-modelapi.my-namespace.svc.cluster.local:8000"
message: ""Modes
Proxy Mode
Uses LiteLLM to proxy requests to external LLM backends.
Container: litellm/litellm:latest
Port: 8000
Wildcard Mode (Recommended for Development)
Proxies any model to the backend (set apiBase without model):
spec:
mode: Proxy
proxyConfig:
apiBase: "http://host.docker.internal:11434"
# No model specified = wildcardAgents can request any model:
ollama/smollm2:135mollama/llama2ollama/mistral
Mock Mode (For Testing)
Set model without apiBase for mock testing:
spec:
mode: Proxy
proxyConfig:
model: "gpt-3.5-turbo" # Model name only, no backendSupports mock_response in request body for deterministic tests.
Config File Mode (Advanced)
Full control over LiteLLM configuration:
spec:
mode: Proxy
proxyConfig:
configYaml:
fromString: |
model_list:
- model_name: "gpt-4"
litellm_params:
model: "azure/gpt-4"
api_base: "https://my-azure.openai.azure.com"
api_key: "os.environ/AZURE_API_KEY"
- model_name: "claude"
litellm_params:
model: "claude-3-sonnet-20240229"
api_key: "os.environ/ANTHROPIC_API_KEY"
env:
- name: AZURE_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: azure-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: anthropic-keyHosted Mode
Runs Ollama in-cluster with the specified model.
Container: ollama/ollama:latest
Port: 11434
spec:
mode: Hosted
hostedConfig:
model: "smollm2:135m"How it works:
- An init container starts Ollama, pulls the specified model, then exits
- The model is stored in a shared volume
- The main Ollama container starts with the model already available
- First pod startup may take 1-2 minutes depending on model size
Spec Fields
mode (required)
| Value | Description |
|---|---|
Proxy | LiteLLM proxy to external backend |
Hosted | Ollama running in-cluster |
proxyConfig (for Proxy mode)
proxyConfig.apiBase
Backend LLM API URL (optional):
proxyConfig:
apiBase: "http://host.docker.internal:11434" # Docker Desktop
# apiBase: "http://ollama.ollama.svc:11434" # In-cluster Ollama
# apiBase: "https://api.openai.com" # OpenAIWhen set without model, enables wildcard mode.
proxyConfig.model
Specific model to proxy (optional):
proxyConfig:
apiBase: "http://localhost:11434"
model: "ollama/smollm2:135m"When set without apiBase, enables mock testing mode.
proxyConfig.configYaml
Full LiteLLM configuration:
proxyConfig:
configYaml:
fromString: |
model_list:
- model_name: "*"
litellm_params:
model: "ollama/*"
api_base: "http://ollama:11434"
# Or from secret:
# fromSecretKeyRef:
# name: litellm-config
# key: config.yamlWhen provided, apiBase and model are ignored.
proxyConfig.env
Environment variables for the LiteLLM container:
proxyConfig:
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: secrets
key: openaihostedConfig (for Hosted mode)
hostedConfig.model
Ollama model to pull and serve:
hostedConfig:
model: "smollm2:135m"
# model: "llama2"
# model: "mistral"hostedConfig.env
Environment variables for Ollama:
hostedConfig:
env:
- name: OLLAMA_DEBUG
value: "true"podSpec (optional)
Override the generated pod spec using Kubernetes strategic merge patch:
spec:
podSpec:
containers:
- name: model-api # Must match the generated container name
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "16Gi"
cpu: "8000m"
nvidia.com/gpu: "1" # For GPU accelerationgatewayRoute (optional)
Configure Gateway API routing, including request timeout:
spec:
gatewayRoute:
# Request timeout for the HTTPRoute (Gateway API Duration format)
# Default: "120s" for ModelAPI, "120s" for Agent, "30s" for MCPServer
# Set to "0s" to use Gateway's default timeout
timeout: "120s"This is especially useful for LLM inference which can take longer than typical HTTP timeouts:
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: ollama-proxy
spec:
mode: Proxy
proxyConfig:
apiBase: "http://ollama.default:11434"
gatewayRoute:
timeout: "5m" # 5 minutes for slow inferenceStatus Fields
| Field | Type | Description |
|---|---|---|
phase | string | Current phase: Pending, Ready, Failed |
ready | bool | Whether ModelAPI is ready |
endpoint | string | Service URL for agents |
message | string | Additional status info |
deployment | object | Deployment status for rolling update visibility |
deployment (status)
Mirrors key status fields from the underlying Kubernetes Deployment:
| Field | Type | Description |
|---|---|---|
replicas | int32 | Total number of non-terminated pods |
readyReplicas | int32 | Number of pods with Ready condition |
availableReplicas | int32 | Number of available pods |
updatedReplicas | int32 | Number of pods with desired template (rolling update progress) |
conditions | array | Deployment conditions (Available, Progressing, ReplicaFailure) |
Examples
Local Development with Host Ollama
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: dev-ollama
spec:
mode: Proxy
proxyConfig:
apiBase: "http://host.docker.internal:11434"In-Cluster Ollama
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: ollama
spec:
mode: Hosted
hostedConfig:
model: "smollm2:135m"
podSpec:
containers:
- name: model-api
resources:
requests:
memory: "2Gi"Mock Testing Mode
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: mock-api
spec:
mode: Proxy
proxyConfig:
model: "gpt-3.5-turbo"OpenAI Proxy
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: openai
spec:
mode: Proxy
proxyConfig:
apiBase: "https://api.openai.com"
model: "gpt-4o-mini"
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-secrets
key: api-keyMulti-Model Routing
apiVersion: kaos.tools/v1alpha1
kind: ModelAPI
metadata:
name: multi-model
spec:
mode: Proxy
proxyConfig:
configYaml:
fromString: |
model_list:
- model_name: "fast"
litellm_params:
model: "ollama/smollm2:135m"
api_base: "http://ollama:11434"
- model_name: "smart"
litellm_params:
model: "gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: secrets
key: openaiTroubleshooting
ModelAPI Stuck in Pending
Check pod status:
kubectl get pods -l modelapi=my-modelapi -n my-namespace
kubectl describe pod -l modelapi=my-modelapi -n my-namespaceCommon causes:
- Image pull errors
- Resource constraints
- For Hosted: Model download in progress
Connection Errors from Agent
Verify endpoint is accessible:
kubectl exec -it deploy/agent-my-agent -n my-namespace -- \
curl http://modelapi-my-modelapi:8000/healthModel Not Available (Hosted Mode)
Check if model is still downloading:
kubectl logs -l modelapi=my-modelapi -n my-namespace -c pull-modelThe model is pulled on startup; large models can take 10+ minutes.