Sovereign AI Routing Control Plane

Posted 3 hours ago

Only freelancers located in the U.S. may apply.U.S. located freelancers only

Summary

We need a senior architect to design and build a multi-model routing control plane, then lead a small senior team through the build. The control plane sits in front of a family of AI systems and decides, per request (text, image, video), the optimal path across cost, quality, latency, business value, and sovereignty (data residency, rights, and cultural fit): cache and reuse, a small or on-device model, an open-weight, fine-tuned, or sovereign model, or a higher-cost frontier fallback. It routes across compute too: CPU, GPU, inference accelerators, on-device, and edge. Core KPIs: the share of eligible workload kept off frontier accelerators and the resulting cost reduction on a representative workload, plus sovereignty compliance, with no quality regression. This is not a chatbot and not a wrapper over hosted APIs. You own the architecture, define the routing logic, and lead execution. You think in systems, not individual model calls. Context The router is one component of a larger AI platform. It must be model-agnostic: open-weight, fine-tuned, and proprietary models swap in and out behind a stable interface without rearchitecting. A separate team owns the models you route to. The engagement is a 60 to 90 day POC with a working router demo (text-first, with a defined path to image and video), followed by technical leadership through the build. What you'll own Control plane: intake and normalization, classification, routing taxonomy, model-selection logic, fallback hierarchy, cache and reuse rules, telemetry, and the eval feedback loop. Routing that is learned and calibrated, not just static rules: predict per-query difficulty and expected quality, and escalate on confidence thresholds. Comfort with cascades and speculative decoding is expected. Routing across cost, quality, latency, and policy. In constrained environments some requests must stay local regardless of cost. Model-agnostic interface: clean, stable contracts so models and execution paths swap without rework, and the separate model team can work independently of the routing layer. Cost optimization across compute: exact and semantic cache, prefix/KV cache reuse, output reuse, batching, small-model routing, CPU offload, and on-device/edge execution, with a clear fallback hierarchy. The goal is to move most eligible workload off frontier accelerators without degrading output. Generative caching and reuse: caching text is easy; image and video are not, since the same prompt should produce variation rather than an identical result. We need credible reuse at the asset or component level, not just for text. Eval loop: scores output quality by domain and flags weakness so the training team can target fixes instead of retraining broadly. Track quality vs intent, failure modes, cost per route, latency per route, cache hit rate, fallback rate, and regeneration rate. Execution and leadership: architecture blueprint, POC scope, milestones, infra assumptions, and risks leadership can review, plus hands-on architecture review and task breakdown. You'll lead a small senior team, and one of your first deliverables is recommending its exact composition (see screening questions). Ideal background Led or architected production AI infrastructure across several of: multi-model orchestration and LLM routing, multimodal, model serving, inference cost and GPU reduction, CPU and on-device inference, open-source and fine-tuned deployment, cascades and speculative decoding, semantic and prefix caching, eval pipelines, and AI observability. Deployed in at least one constrained environment: on-prem, self-hosted, air-gapped, or data-residency-restricted. You know what breaks when you can't lean on a single cloud. Can lead: set architecture, break down work, review the team's output, and keep the build on track. Tools matter less than the ability to architect the system correctly and lead execution. Not a fit: basic chatbot workflows, hosted APIs only, or prompt engineering alone. Deliverables Control plane blueprint, routing taxonomy, POC plan with milestones and success criteria, and an eval/feedback framework, with a working router demo as the 60 to 90 day target, then technical leadership of a small team through the build. Screening questions The most relevant AI routing, model-serving, or inference infrastructure system you personally designed or built: what was routed, which models or execution paths, and what role did you own? How would you design a router that chooses between cache/reuse, a smaller or local model, an open-weight or fine-tuned model, or a frontier fallback, across CPU and GPU? Where do learned routing, cascades, or speculative decoding fit? For generative image or video requests, how would you approach caching or reuse when the same prompt should still allow variation? Be specific. What metrics and eval loop would you use to prove the router cuts cost without degrading quality, and to help a separate training team find weaknesses? Beyond yourself, what team would you staff to hit these deliverables in 8 weeks? Give the roles, seniority, and headcount, how you'd split the work, and flag any deliverable that 8 weeks and a team of roughly 4 engineers can't realistically cover. To apply Answer the five questions, summarize your most relevant routing or inference-infrastructure work (repos, writeups, talks, or architecture you can describe), and give your high-level approach to a control plane that routes across cost, quality, and sovereignty while preserving quality. Note your availability, your rate, whether you've led a small engineering team before, and the team you'd staff to hit the deliverables in 8 weeks.

Less than 30 hrs/week
Hourly
1-3 months
Duration
Expert
Experience Level
Remote Job
Ongoing project
Project Type

Skills and Expertise

Mandatory skills

Machine Learning

Artificial Intelligence

Activity on this job

Proposals:15 to 20
Interviewing:
0
Invites sent:
0
Unanswered invites:
0

About the client

Member since Aug 2, 2015

United States
Alpharetta7:16 PM
$26K total spent
73 hires, 5 active
740 hours
Tech & IT
Individual client

Explore similar jobs on Upwork

Quantum Computing Consultant – High-Dimensional Combinatorial Opt…Hourly‐ Posted 4 weeks ago

Quantum Computing

Data Scientist (Mid-to-Senior) — Machine Learning & Predictive An…Hourly‐ Posted 4 weeks ago

Predictive Model

SQL

pandas

Data Science

Python

Machine Learning

Python Scikit-Learn

Deep Learning

Predictive Analytics

Data Analysis

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.