Deploy a containerd runtime shim for serving AI models with Ollama.
Create a kind cluster with the ollama
shim:
make kind
Deploy your model using the ollama-shim
runtime class:
kubectl create namespace ai-models
kubectl apply -n ai-models -f ./manifests/models/qwen2-model.yaml
Verify that the model is running:
kubectl wait --for=condition=available -n ai-models deployment/qwen2 --timeout=1m
Port-forward the model service to your local machine:
kubectl port-forward -n ai-models svc/qwen2 8080:80
Ask some questions to the model from your local machine (in a new terminal):
curl http://localhost:8080/api/generate -d '{
"model": "qwen2:latest",
"prompt": "What is the Kubecon?",
"stream": false
}' | jq -r '.response'
Connect to the kind node:
docker exec -it ollama-shim-control-plane bash
Inspect the logs of the containerd runtime to see the ollama
shim in action:
journalctl -f -u containerd
Find the model from the image snapshot:
find /var/lib/containerd/ -name '*.gguf'
Start the model locally on the node:
ollama runner --port 8080 --ctx-size 8192 --model ${model}
Delete the cluster to clean everything up:
make clean
- What is a shim?
- Containerd quickstart
- Containerd runtime documentation
- Kind example
- Digging into runc