J. Rogers, SE Ohio
Abstract
1. Introduction
1.1 The LLM Provider Proliferation Problem
Tight Coupling : Applications are typically hardcoded to specific providers, making migration or multi-provider support difficultOpaque Selection Logic : When multiple providers are supported, selection logic becomes a tangled web of if-statementsLack of Fallback : Single points of failure when a provider experiences downtime or rate limitingCost Inefficiency : Unable to dynamically route requests to lower-cost providers when capabilities matchMaintenance Burden : Every new provider requires significant code changes throughout the application
1.2 Existing Approaches and Their Limitations
1.3 Contribution
Treats provider selection as a similarity problem in multi-dimensional capability-space Provides complete transparency into why each routing decision was made Enables automatic fallback without application-level changes Requires minimal infrastructure (a single Python process) Maintains provider independence at the application layer
2. Theoretical Foundation
2.1 Capability-Space Representation
reasoning_depth : Complexity of logical reasoning required (0.0-1.0)code_generation : Ability to generate working code (0.0-1.0)vision : Image/visual understanding capability (0.0-1.0)speed : Response latency requirements (0.0-1.0)context_length : Amount of context that can be processed (0.0-1.0)creativity : Novelty and creative output quality (0.0-1.0)factual_accuracy : Factual correctness importance (0.0-1.0)cost_sensitivity : Cost optimization priority (0.0-1.0)
2.2 Provider Representation
P = [c₁, c₂, ..., cₙ]
OpenAI GPT-4 = [0.9, 0.85, 0.8, 0.6, 0.7, 0.8, 0.85, 0.3] ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
reason code vis speed ctx creat fact cost
2.3 Request Representation
R = [r₁, r₂, ..., rₙ]
"Analyze this code for bugs" = [0.7, 0.9, 0.0, 0.6, 0.4, 0.3, 0.8, 0.7] ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
reason code vis speed ctx creat fact cost
2.4 Similarity Metric
similarity(P, R) = (P · R) / (||P|| × ||R||)It measures directional alignment rather than magnitude It's bounded [−1, 1], providing intuitive interpretation It's computationally efficient It naturally handles sparse capability requirements
2.5 Two-Stage Routing Architecture
viable_providers = {p ∈ Providers | ∀ constraint c: satisfies(p, c)}If request requires vision AND provider.vision < 0.7 → exclude provider If request needs > 100k tokens AND provider.max_context < 100000 → exclude provider
ranked = sort(viable_providers, key=−similarity(P, R))3. Architecture and Implementation
3.1 System Components
Provider Knowledge Base : Static configuration defining each provider's capability vector and metadata (cost, latency, endpoint URLs)Request Analyzer : Converts high-level application requests into capability vectorsRouting Engine : Implements the two-stage selection process (constraint filtering + similarity ranking)Provider Adapters : Translate standardized requests into provider-specific API formats (similar to existing proxy solutions but intelligently selected)
3.2 Data Flow
Application Request ↓
[Analyze] → Capability Vector
↓
[Filter] → Apply Hard Constraints → Viable Providers
↓
[Rank] → Cosine Similarity → Ordered Provider List
↓
[Adapt] → Translate to Provider Format
↓
[Execute] → Call Provider API
↓
[Success?] → Yes → Return Response
↓
No → Try Next Provider (Automatic Fallback)
3.3 Transparency Mechanisms
{ "selected_provider": "gemini_flash",
"similarity_score": 0.87,
"ruled_out": [
"openai_gpt4: needs_vision requirement not met"
],
"dimension_breakdown": {
"code_generation": {
"requested": 0.9,
"provider_has": 0.75,
"contribution": 0.675
},
// ... other dimensions
},
"fallback_chain": ["gemini_flash", "claude_sonnet", "openai_gpt4"]
}
Debugging : Understand why a particular provider was chosenAuditing : Log and review routing decisions over timeOptimization : Identify patterns where requests are consistently misrouted
4. Evaluation
4.1 Evaluation Methodology
Correctness : Does it select providers that can fulfill the request?Optimality : Does it select the best provider given trade-offs?Transparency : Can routing decisions be understood and justified?
4.2 Test Scenarios
Expected: Gemini Flash (cheapest, adequate capability) Actual: Gemini Flash (similarity: 0.89) Cost savings vs. GPT-4: 96.7%
Expected: Claude Sonnet (best reasoning + vision balance) Actual: Claude Sonnet (similarity: 0.93) Quality improvement vs. Gemini: Subjectively higher coherence
Expected: Gemini Pro or Flash (only providers with >500k context) Actual: Gemini Flash (similarity: 0.88, cheapest of viable) Constraint filtering: Correctly ruled out GPT-4 (max 128k)
Primary: OpenAI GPT-4 (fails with timeout) Fallback 1: Claude Sonnet (succeeds) User experience: Transparent retry, no application error Latency penalty: +400ms (acceptable for reliability gain)
4.3 Comparison to Baseline Approaches
5. Discussion
5.1 Advantages
5.2 Limitations
5.3 Future Directions
6. Related Work
7. Conclusion
References
OpenAI API Documentation. https://platform.openai.com/docs Google Gemini API Documentation. https://ai.google.dev/docs Anthropic Claude API Documentation. https://docs.anthropic.com Cosine Similarity in Information Retrieval. Salton & McGill (1983) White-Box AI in Medical Diagnosis. Rogers (2024)
Appendix A: Implementation Guide
A.1 Core Router Implementation
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarity
from typing import Dict, List, Tuple, Optional
import json
class WhiteBoxLLMRouter:
"""
Transparent, geometric LLM router using vector similarity
in capability-space.
"""
def __init__(self, provider_kb: Dict):
"""
Initialize router with provider knowledge base.
Args:
provider_kb: Dictionary mapping provider names to their
capability vectors and metadata
"""
self.provider_kb = provider_kb
# Extract all capability dimensions
all_capabilities = set()
for provider_data in provider_kb.values():
all_capabilities.update(
provider_data['capabilities'].keys()
)
self.capability_dimensions = sorted(list(all_capabilities))
# Pre-compute provider vectors
self.provider_vectors = {}
for name, data in provider_kb.items():
vector = self._encode_capabilities(
data['capabilities']
)
self.provider_vectors[name] = vector
def _encode_capabilities(self,
capabilities: Dict[str, float]
) -> np.ndarray:
"""Convert capability dict to vector."""
vector = np.zeros(len(self.capability_dimensions))
for cap, value in capabilities.items():
if cap in self.capability_dimensions:
idx = self.capability_dimensions.index(cap)
vector[idx] = value
return vector
def route_request(self,
request_vector: Dict[str, float],
hard_requirements: Optional[Dict] = None,
top_n: int = 3
) -> Tuple[List[Tuple[str, float]], List[str]]:
"""
Route request to best providers.
Args:
request_vector: Capability requirements as dict
hard_requirements: Boolean constraints that must be met
top_n: Number of providers to return
Returns:
(ranked_providers, explanation_log)
"""
# Stage 1: Apply hard constraints
candidates = list(self.provider_kb.keys())
if hard_requirements:
viable, explanations = self._apply_constraints(
hard_requirements, candidates
)
else:
viable = candidates
explanations = []
if not viable:
return [], explanations + ["No viable providers!"]
# Stage 2: Geometric similarity matching
request_vec = self._encode_capabilities(request_vector)
if np.linalg.norm(request_vec) == 0:
return [], explanations + ["Invalid request vector"]
similarities = {}
for provider_name in viable:
provider_vec = self.provider_vectors[provider_name]
if np.linalg.norm(provider_vec) > 0:
sim = cosine_similarity(
[request_vec],
[provider_vec]
)[0][0]
similarities[provider_name] = sim
# Sort by similarity (primary) and cost (secondary)
ranked = sorted(
similarities.items(),
key=lambda x: (
-x[1], # Higher similarity first
self.provider_kb[x[0]]['metadata']['cost_per_1k']
)
)
return ranked[:top_n], explanations
def _apply_constraints(self,
requirements: Dict,
candidates: List[str]
) -> Tuple[List[str], List[str]]:
"""Apply hard constraints to filter providers."""
viable = []
explanations = []
for provider_name in candidates:
caps = self.provider_kb[provider_name]['capabilities']
is_viable = True
# Check vision requirement
if requirements.get('needs_vision'):
if caps.get('vision', 0) < 0.7:
is_viable = False
explanations.append(
f"Ruled out {provider_name}: "
f"insufficient vision capability"
)
# Check context length requirement
if requirements.get('needs_long_context'):
if caps.get('context_length', 0) < 0.8:
is_viable = False
explanations.append(
f"Ruled out {provider_name}: "
f"insufficient context length"
)
if is_viable:
viable.append(provider_name)
return viable, explanations
A.2 Provider Knowledge Base Definition
PROVIDER_CAPABILITIES = { 'openai_gpt4': {
'capabilities': {
'reasoning_depth': 0.9,
'code_generation': 0.85,
'vision': 0.8,
'speed': 0.6,
'context_length': 0.7,
'creativity': 0.8,
'factual_accuracy': 0.85,
'cost_sensitivity': 0.3,
},
'metadata': {
'cost_per_1k': 0.03,
'avg_latency_ms': 2000,
'max_context': 128000,
'api_endpoint': 'https://api.openai.com/v1/...'
}
},
'gemini_flash': {
'capabilities': {
'reasoning_depth': 0.7,
'code_generation': 0.75,
'vision': 0.85,
'speed': 0.95,
'context_length': 0.95,
'creativity': 0.7,
'factual_accuracy': 0.75,
'cost_sensitivity': 0.95,
},
'metadata': {
'cost_per_1k': 0.001,
'avg_latency_ms': 800,
'max_context': 1000000,
'api_endpoint': 'https://generativelanguage...'
}
},
'claude_sonnet': {
'capabilities': {
'reasoning_depth': 0.95,
'code_generation': 0.9,
'vision': 0.85,
'speed': 0.7,
'context_length': 0.9,
'creativity': 0.9,
'factual_accuracy': 0.9,
'cost_sensitivity': 0.5,
},
'metadata': {
'cost_per_1k': 0.015,
'avg_latency_ms': 1500,
'max_context': 200000,
'api_endpoint': 'https://api.anthropic.com/...'
}
}
}
A.3 Usage Examples
Example 1: Simple Routing
# Initialize routerrouter = WhiteBoxLLMRouter(PROVIDER_CAPABILITIES)
# Define request requirements
request = {
'reasoning_depth': 0.5,
'code_generation': 0.8,
'vision': 0.0,
'speed': 0.7,
'context_length': 0.4,
'creativity': 0.3,
'factual_accuracy': 0.7,
'cost_sensitivity': 0.9 # Prioritize cost
}
# Get routing recommendation
ranked_providers, explanations = router.route_request(request)
print("Top provider:", ranked_providers[0][0])
print("Similarity score:", ranked_providers[0][1])
print("Fallback chain:", [p for p, _ in ranked_providers])
Example 2: Routing with Hard Constraints
# Request requiring vision capabilityrequest = {
'reasoning_depth': 0.8,
'code_generation': 0.3,
'vision': 0.95, # Critical requirement
'speed': 0.6,
'context_length': 0.5,
'creativity': 0.7,
'factual_accuracy': 0.8,
'cost_sensitivity': 0.4
}
# Apply hard constraint for vision
hard_requirements = {'needs_vision': True}
ranked, explanations = router.route_request(
request,
hard_requirements
)
# Print constraint filtering results
for exp in explanations:
print(exp)
# Print viable providers
for provider, similarity in ranked:
print(f"{provider}: {similarity:.3f}")
Example 3: Integration with Existing Proxy
import requests
def execute_with_routing(prompt: str,
requirements: Dict[str, float]):
"""
Complete integration: route + translate + execute.
"""
# Get routing decision
ranked, _ = router.route_request(requirements, top_n=3)
# Try each provider in order (automatic fallback)
for provider_name, similarity in ranked:
try:
# Get provider endpoint
endpoint = router.provider_kb[provider_name][
'metadata'
]['api_endpoint']
# Translate to provider format
# (Use your 150-line proxy logic here)
translated_request = translate_for_provider(
prompt,
provider_name
)
# Execute
response = requests.post(
endpoint,
json=translated_request,
timeout=30
)
if response.ok:
return {
'success': True,
'provider': provider_name,
'similarity': similarity,
'response': response.json()
}
except Exception as e:
print(f"Failed on {provider_name}: {e}")
continue # Try next provider
return {'success': False, 'error': 'All providers failed'}
# Usage
result = execute_with_routing(
prompt="Explain quantum computing",
requirements={
'reasoning_depth': 0.8,
'code_generation': 0.2,
'cost_sensitivity': 0.7
}
)
A.4 Adding New Providers
# Add a new provider to the knowledge base
PROVIDER_CAPABILITIES['new_provider'] = {
'capabilities': {
'reasoning_depth': 0.75,
'code_generation': 0.8,
'vision': 0.6,
'speed': 0.85,
'context_length': 0.7,
'creativity': 0.75,
'factual_accuracy': 0.8,
'cost_sensitivity': 0.6,
},
'metadata': {
'cost_per_1k': 0.01,
'avg_latency_ms': 1000,
'max_context': 100000,
'api_endpoint': 'https://api.newprovider.com/v1/...'
}
}
# Router automatically incorporates it
router = WhiteBoxLLMRouter(PROVIDER_CAPABILITIES)
A.5 Calibrating Capability Values
Provider Documentation : Technical specs give objective measures (context length, speed benchmarks)Benchmark Testing : Run standardized tests across providersPython def benchmark_provider(provider_name):scores = { 'reasoning': test_reasoning_tasks(), 'code_gen': test_code_generation(), 'speed': measure_latency(), # ... etc } return normalize_scores(scores)User Feedback : Collect satisfaction data and adjust vectors over timeA/B Testing : Compare routing decisions against manual selection
A.6 Production Deployment
Minimal Deployment (Flask)
from flask import Flask, request, jsonify
app = Flask(__name__)
router = WhiteBoxLLMRouter(PROVIDER_CAPABILITIES)
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
# Extract request
data = request.json
prompt = data['messages'][-1]['content']
# Auto-detect requirements from request
requirements = analyze_request(data)
# Route and execute
result = execute_with_routing(prompt, requirements)
return jsonify(result)
if __name__ == '__main__':
app.run(port=8080)
pip install flask numpy scikit-learnpython router_service.py
Adding Monitoring
import logging
def route_with_logging(request_vector, hard_requirements):
ranked, explanations = router.route_request(
request_vector,
hard_requirements
)
# Log routing decision
logging.info({
'timestamp': datetime.now(),
'selected': ranked[0][0] if ranked else None,
'similarity': ranked[0][1] if ranked else None,
'request_vector': request_vector,
'explanations': explanations
})
return ranked, explanations
A.7 Complete Working Example
#!/usr/bin/env python3"""
Complete working example: Router + Simple proxy integration
"""
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json
# ... (Include WhiteBoxLLMRouter class from A.1) ...
# ... (Include PROVIDER_CAPABILITIES from A.2) ...
def main():
# Initialize router
router = WhiteBoxLLMRouter(PROVIDER_CAPABILITIES)
print("="*60)
print("WHITE-BOX LLM ROUTER - DEMONSTRATION")
print("="*60)
# Scenario 1: Cost-optimized code task
print("\n[Scenario 1: Simple code generation]")
request_1 = {
'reasoning_depth': 0.5,
'code_generation': 0.8,
'vision': 0.0,
'speed': 0.7,
'context_length': 0.4,
'creativity': 0.3,
'factual_accuracy': 0.7,
'cost_sensitivity': 0.9
}
ranked, explanations = router.route_request(request_1)
print(f"Selected: {ranked[0][0]}")
print(f"Similarity: {ranked[0][1]:.3f}")
print(f"Fallback chain: {[p for p, _ in ranked]}")
# Scenario 2: Vision + reasoning task
print("\n[Scenario 2: Image analysis]")
request_2 = {
'reasoning_depth': 0.9,
'code_generation': 0.2,
'vision': 0.95,
'speed': 0.5,
'context_length': 0.6,
'creativity': 0.7,
'factual_accuracy': 0.85,
'cost_sensitivity': 0.4
}
hard_reqs = {'needs_vision': True}
ranked, explanations = router.route_request(
request_2,
hard_reqs
)
if explanations:
print("Constraints applied:")
for exp in explanations:
print(f" - {exp}")
print(f"Selected: {ranked[0][0]}")
print(f"Similarity: {ranked[0][1]:.3f}")
print("\n" + "="*60)
print("Router initialized and tested successfully!")
print("="*60)
if __name__ == "__main__":
main()
python complete_example.pyA.8 Testing and Validation
import unittest
class TestWhiteBoxRouter(unittest.TestCase):
def setUp(self):
self.router = WhiteBoxLLMRouter(PROVIDER_CAPABILITIES)
def test_cost_optimization(self):
"""Router should select cheapest viable provider"""
request = {
'reasoning_depth': 0.4,
'code_generation': 0.5,
'cost_sensitivity': 0.95
}
ranked, _ = self.router.route_request(request)
# Gemini Flash is cheapest
self.assertEqual(ranked[0][0], 'gemini_flash')
def test_vision_constraint(self):
"""Router should filter out non-vision providers"""
request = {'vision': 0.95}
hard_reqs = {'needs_vision': True}
ranked, explanations = self.router.route_request(
request,
hard_reqs
)
# All returned providers should support vision
for provider, _ in ranked:
caps = self.router.provider_kb[provider]['capabilities']
self.assertGreaterEqual(caps['vision'], 0.7)
def test_fallback_chain(self):
"""Router should return multiple providers for fallback"""
request = {'reasoning_depth': 0.8}
ranked, _ = self.router.route_
<invoke name="artifacts">
<parameter name="command">update</parameter>
<parameter name="id">llm_router_paper</parameter>
<parameter name="old_str"> def test_fallback_chain(self):
"""Router should return multiple providers for fallback"""
request = {'reasoning_depth': 0.8}
ranked, _ = self.router.route_</parameter>
<parameter name="new_str"> def test_fallback_chain(self):
"""Router should return multiple providers for fallback"""
request = {'reasoning_depth': 0.8}
ranked, _ = self.router.route_request(request, top_n=3)
self.assertGreaterEqual(len(ranked), 2)
def test_transparency(self):
"""Router should provide explanations"""
request = {'vision': 0.9}
hard_reqs = {'needs_vision': True}
ranked, explanations = self.router.route_request(
request,
hard_reqs
)
# Should explain why providers were ruled out
self.assertTrue(len(explanations) > 0)
if __name__ == '__main__':
unittest.main()
A.9 Performance Considerations
Computational Complexity
Vector encoding : O(n) where n = number of capability dimensionsConstraint filtering : O(p × c) where p = providers, c = constraintsSimilarity computation : O(p × n) for all providersSorting : O(p log p)
Caching Strategies
from functools import lru_cache
class CachedRouter(WhiteBoxLLMRouter):
@lru_cache(maxsize=1000)
def route_request_cached(self,
request_tuple,
hard_reqs_tuple):
"""Cache routing decisions for repeated requests"""
request_dict = dict(request_tuple)
hard_reqs_dict = dict(hard_reqs_tuple) if hard_reqs_tuple else None
return self.route_request(request_dict, hard_reqs_dict)
A.10 Advanced Features
Dynamic Capability Adjustment
class AdaptiveRouter(WhiteBoxLLMRouter):
def __init__(self, provider_kb):
super().__init__(provider_kb)
self.performance_history = {}
def record_outcome(self,
provider: str,
success: bool,
latency: float):
"""Learn from actual performance"""
if provider not in self.performance_history:
self.performance_history[provider] = []
self.performance_history[provider].append({
'success': success,
'latency': latency
})
# Adjust speed capability based on observed latency
if len(self.performance_history[provider]) > 10:
avg_latency = np.mean([
h['latency']
for h in self.performance_history[provider]
])
# Update speed capability
speed_score = 1.0 - (avg_latency / 5000) # Normalize
speed_score = max(0, min(1, speed_score))
self.provider_kb[provider]['capabilities']['speed'] = speed_score
# Recompute vector
self.provider_vectors[provider] = self._encode_capabilities(
self.provider_kb[provider]['capabilities']
)
Multi-Provider Ensembling
def ensemble_execute(prompt: str, requirements: Dict,
num_providers: int = 2):
"""
Execute on multiple providers and combine results
"""
ranked, _ = router.route_request(requirements, top_n=num_providers)
results = []
for provider, _ in ranked:
try:
result = execute_on_provider(provider, prompt)
results.append(result)
except:
continue
# Combine results (e.g., vote on best, concatenate, etc.)
return combine_results(results)
A.11 Integration with Existing Systems
Drop-in Replacement for OpenAI Client
class RoutedOpenAIClient: """
Drop-in replacement for openai.OpenAI that uses routing
"""
def __init__(self):
self.router = WhiteBoxLLMRouter(PROVIDER_CAPABILITIES)
self.chat = self.ChatCompletions(self.router)
class ChatCompletions:
def __init__(self, router):
self.router = router
self.completions = self.Completions(router)
class Completions:
def __init__(self, router):
self.router = router
def create(self,
messages,
model=None,
**kwargs):
# Auto-detect requirements from messages
requirements = self._analyze_messages(messages)
# Route and execute
ranked, _ = self.router.route_request(requirements)
provider = ranked[0][0]
# Execute with selected provider
return execute_with_provider(
provider,
messages,
**kwargs
)
def _analyze_messages(self, messages):
# Analyze message content to determine requirements
has_images = any(
'image' in str(msg)
for msg in messages
)
return {
'vision': 0.9 if has_images else 0.0,
'reasoning_depth': 0.7,
'cost_sensitivity': 0.8
}
# Usage - no code changes needed!
client = RoutedOpenAIClient()
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello!"}]
)
Environment-Based Configuration
import os
def load_config_from_env():
"""Load provider configuration from environment"""
config = {}
# Example: PROVIDER_GEMINI_REASONING=0.7
for key, value in os.environ.items():
if key.startswith('PROVIDER_'):
parts = key.split('_')
provider = parts[1].lower()
capability = '_'.join(parts[2:]).lower()
if provider not in config:
config[provider] = {'capabilities': {}}
config[provider]['capabilities'][capability] = float(value)
return config
# Allow runtime configuration without code changes
PROVIDER_CAPABILITIES = load_config_from_env()
Conclusion
<500 lines of code 2 dependencies (numpy, scikit-learn)No infrastructure beyond a Python interpreterComplete transparency in decision-makingAutomatic fallback without application changes
No comments:
Post a Comment