Install with Gateway API Inference Extension
This guide provides step-by-step instructions for integrating the vLLM Semantic Router (vSR) with Istio and the Kubernetes Gateway API Inference Extension (GIE). This powerful combination allows you to manage self-hosted, OpenAI-compatible models using Kubernetes-native APIs for advanced, load-aware routing.
Architecture Overview​
The deployment consists of three main components:
- vLLM Semantic Router: The brain that classifies incoming requests based on their content.
- Istio & Gateway API: The network mesh and the front door for all traffic entering the cluster.
- Gateway API Inference Extension (GIE): A set of Kubernetes-native APIs (
InferencePool, etc.) for managing and scaling self-hosted model backends.
Benefits of Integration​
Integrating vSR with Istio and GIE provides a robust, Kubernetes-native solution for serving LLMs with several key benefits:
1. Kubernetes-Native LLM Management​
Manage your models, routing, and scaling policies directly through kubectl using familiar Custom Resource Definitions (CRDs).
2. Intelligent Model and Replica Routing​
Combine vSR's prompt-based model routing with GIE's smart, load-aware replica selection. This ensures requests are sent not only to the right model but also to the healthiest replica, all in a single, efficient hop.
3. Protect Your Models from Overload​
The built-in scheduler tracks GPU load and request queues, automatically shedding traffic to prevent your model servers from crashing under high demand.
4. Deep Observability​
Gain insights from both high-level Gateway metrics and detailed vSR performance data (like token usage and classification accuracy) to monitor and troubleshoot your entire AI stack.
5. Secure Multi-Tenancy​
Isolate tenant workloads using standard Kubernetes namespaces and HTTPRoutes. Apply rate limits and other policies while sharing a common, secure gateway infrastructure.
Supported Backend Models​
This architecture is designed to work with any self-hosted model that exposes an OpenAI-compatible API. The demo models in this guide use vLLM to serve Llama3 and Phi-3, but you can easily replace them with your own model servers.
Prerequisites​
Before starting, ensure you have the following tools installed:
- Docker or another container runtime.
- kind v0.22+ or any Kubernetes 1.29+ cluster.
- kubectl v1.30+.
- Helm v3.14+.
- istioctl v1.28+.
- A Hugging Face token stored in the
HF_TOKENenvironment variable, required for the sample vLLM deployments to download models.
You can validate your toolchain versions with the following commands:
kind version
kubectl version --client --short
helm version --short
istioctl version --remote=false
Step 1: Create a Kind Cluster (Optional)​
If you don't have a Kubernetes cluster, create a local one for testing:
kind create cluster --name vsr-gie
# Verify the cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Step 2: Install Istio​
Install Istio with support for the Gateway API and external processing:
# Download and install Istio
export ISTIO_VERSION=1.29.0
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$ISTIO_VERSION sh -
export PATH="$PWD/istio-$ISTIO_VERSION/bin:$PATH"
istioctl install -y --set profile=minimal --set values.pilot.env.ENABLE_GATEWAY_API=true
# Verify Istio is ready
kubectl wait --for=condition=Available deployment/istiod -n istio-system --timeout=300s
Step 3: Install Gateway API & GIE CRDs​
Install the Custom Resource Definitions (CRDs) for the standard Gateway API and the Inference Extension:
# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml
# Install Gateway API Inference Extension CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml
# Verify CRDs are installed
kubectl get crd | grep 'gateway.networking.k8s.io'
kubectl get crd | grep 'inference.networking.k8s.io'
Step 4: Deploy Demo LLM Servers​
Deploy two vLLM instances (Llama3 and Phi-3) to act as our backends. These will be automatically downloaded from Hugging Face.
# Create namespace and secret for the models
kubectl create namespace llm-backends --dry-run=client -o yaml | kubectl apply -f -
kubectl -n llm-backends create secret generic hf-token --from-literal=token=$HF_TOKEN
# Deploy the model servers
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vLlama3.yaml
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vPhi4.yaml
# Wait for models to be ready (this may take several minutes)
kubectl -n llm-backends wait --for=condition=Ready pods --all --timeout=10m
Step 5: Deploy vLLM Semantic Router​
Deploy the vLLM Semantic Router using its official Helm chart. This component will run as an ext_proc server that Istio calls for routing decisions.
helm upgrade -i semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace vllm-semantic-router-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml
# Wait for the router to be ready
kubectl -n vllm-semantic-router-system wait --for=condition=Available deploy/semantic-router --timeout=10m
Step 6: Deploy Gateway and Routing Logic​
Apply the final set of resources to create the public-facing Gateway and wire everything together. This includes the Gateway, InferencePools for GIE, HTTPRoutes for traffic matching, and Istio's EnvoyFilter.
# Apply all routing and gateway resources
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml
# Verify the Gateway is programmed by Istio
kubectl wait --for=condition=Programmed gateway/inference-gateway --timeout=120s
Testing the Deployment​
Method 1: Port Forwarding​
Set up port forwarding to access the gateway from your local machine.
# The Gateway service is named 'inference-gateway-istio' and lives in the default namespace
kubectl port-forward svc/inference-gateway-istio 8080:80
Send Test Requests​
Once port forwarding is active, you can send OpenAI-compatible requests to localhost:8080.
Test 1: Explicitly request a model This request bypasses the semantic router's logic and goes directly to the specified model pool.
curl -sS http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-8b",
"messages": [{"role": "user", "content": "Summarize the Kubernetes Gateway API in three sentences."}]
}'
Test 2: Let the Semantic Router choose the model
By setting "model": "auto", you ask vSR to classify the prompt. It will identify this as a "math" query and add the x-selected-model: phi4-mini header, which HTTPRoute uses to route the request.
curl -sS http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is 2+2 * (5-1)?"}],
"max_tokens": 64
}'
Troubleshooting​
Problem: CRDs are missing
If you see errors like no matches for kind "InferencePool", check that the CRDs are installed.
# Check for GIE CRDs
kubectl get crd | grep inference.networking.k8s.io
Problem: Gateway is not ready
If kubectl port-forward fails or requests time out, check the Gateway status.
# The "Programmed" condition should be "True"
kubectl get gateway inference-gateway -o yaml
Problem: vSR is not being called
If requests work but routing seems incorrect, check the Istio proxy logs for ext_proc errors.
# Get the Istio gateway pod name
export ISTIO_GW_POD=$(kubectl get pod -l istio=ingressgateway -o jsonpath='{.items[0].metadata.name}')
# Check its logs
kubectl logs $ISTIO_GW_POD -c istio-proxy | grep ext_proc
Problem: Requests are failing Check the logs for the vLLM Semantic Router and the backend models.
# Check vSR logs
kubectl logs deploy/semantic-router -n vllm-semantic-router-system
# Check Llama3 backend logs
kubectl logs -n llm-backends -l app=vllm-llama3-8b-instruct
Cleanup​
To remove all the resources created in this guide, run the following commands.
# 1. Delete all applied Kubernetes resources
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml
kubectl delete ns llm-backends
# 2. Uninstall Helm releases
helm uninstall semantic-router -n vllm-semantic-router-system
# 3. Uninstall Istio
istioctl uninstall -y --purge
# 4. Delete the kind cluster (Optional)
kind delete cluster --name vsr-gie
Next Steps​
- Customize Routing: Modify the
values.yamlfile for thesemantic-routerHelm chart to define your own routing categories and rules. - Add Your Own Models: Replace the demo Llama3 and Phi-3 deployments with your own OpenAI-compatible model servers.
- Explore Advanced GIE Features: Look into using
InferenceObjectivefor more advanced autoscaling and scheduling policies. - Monitor Performance: Integrate your Gateway and vSR with Prometheus and Grafana to build monitoring dashboards.