Senior AI Infrastructure Architect - Multi-Model Routing Control Plane
Only freelancers located in the U.S. may apply.U.S. located freelancers only
We need a senior architect to lead the design and build of a multi-model routing control plane, then guide a small senior team through the build. The control plane sits in front of a family of AI systems and decides, for every request (text, image, video), the cheapest path that still meets quality: cache, reuse, a small or local model, an on-device model, an open-weight model, a fine-tuned model, or a higher-cost frontier fallback. It must route not just across models but across compute: CPU, GPU, on-device, and edge. The north-star metric is the share of requests served without touching an expensive frontier GPU, and the resulting cost reduction on a representative workload. The ambition is to move the majority of eligible workload off frontier GPUs onto cheaper paths without degrading output. This is not a chatbot project and it is not a thin wrapper over hosted APIs. You will own the architecture, define the routing logic, and lead execution. We need someone who thinks in systems, not individual model calls. Context (so you understand what we need delivered) The router is one component of a larger AI platform, not a standalone product. It must be model-agnostic: open-weight, fine-tuned, and proprietary models get swapped in and out behind a stable interface without rearchitecting. You will coordinate with a separate team that owns the models you route to. The initial engagement is a 60 to 90 day POC with a working demo of the router as the goal, followed by technical leadership through the build. What You Will Own - Control plane architecture: request intake and normalization, classification, routing taxonomy, model-selection rules, fallback logic, cache and reuse rules, logging and telemetry, and the evaluation feedback loop. - Model-agnostic interface: clean, stable contracts so models and execution paths swap in and out without rework, and so the separate team that owns the models can work independently of the routing layer. - Cost optimization across compute, not just models: reduce unnecessary GPU usage while preserving quality, using exact and semantic cache, existing output reuse, lightweight and small-model routing, batching, CPU offload, on-device and edge execution where appropriate, and a clear fallback hierarchy. The explicit goal is to shift a large share of workload off frontier GPUs. Generative caching and reuse: caching text is straightforward. Caching generative image and video is not, since the same prompt should produce variation rather than an identical result. We need a credible approach to reuse at the asset or component level, not just for text. - Evaluation loop: a framework that scores output quality by content domain and flags weakness, so the training team can target improvements instead of retraining broadly. Track output quality against intent, failure modes, cost per route, latency per route, cache hit rate, fallback rate, and regeneration rate. - Execution plan and technical leadership: an architecture diagram, recommended POC scope, milestones, infrastructure assumptions, and risks that leadership can review, plus hands-on architecture review and task breakdown. You will lead a small senior team (up to 4 engineers) through the POC build. Ideal Background - You have led or architected production AI infrastructure involving several of the following: multi-model orchestration and LLM routing, multimodal AI, model serving, inference cost optimization, GPU cost reduction, CPU and on-device inference, open-source and fine-tuned model deployment, evaluation pipelines, semantic caching, and AI observability. - You have deployed in at least one constrained environment: on-prem, self-hosted, air-gapped, or data-residency-restricted. You know what breaks when you cannot lean on a single cloud. - You can lead. This is a technical lead role, so you will set architecture, break down work, review the team's output, and keep the build on track. Specific tools matter less than the ability to architect the system correctly and lead execution. We are not looking for someone who only builds basic chatbot workflows, only uses hosted APIs without understanding the underlying infrastructure, or works as a prompt engineer alone. Deliverables - The initial engagement should produce a control plane architecture blueprint, a routing taxonomy, a POC execution plan with milestones and success criteria, and an evaluation and feedback framework, with a working router demo as the 60 to 90 day target, followed by technical leadership of a small team through the build. Screening Questions - Describe the most relevant AI routing, model-serving, or inference infrastructure system you have personally designed or built. What was routed, what models or execution paths were involved, and what role did you own? - How would you design a router that decides whether a request should use cache/reuse, a smaller or local model, an open-weight or fine-tuned model, or a higher-cost frontier fallback, across both CPU and GPU? - For generative image or video requests, how would you approach caching or reuse when the same prompt should still allow variation? Please be specific. - What metrics and evaluation loop would you use to prove the router is reducing cost without degrading output quality, and to help a separate model-training team identify weaknesses? To Apply Answer the questions above to the best of your ability. Summarize your most relevant routing or inference-infrastructure work, link any repos or examples, give your high-level approach to a control plane that cuts GPU usage while preserving quality, and note your availability and whether you have led a small engineering team before.
- Less than 30 hrs/weekHourly
- 1-3 monthsDuration
- ExpertExperience Level
- Remote Job
- Ongoing projectProject Type
Skills and Expertise
Activity on this job
- Proposals:20 to 50
- Last viewed by client:3 days ago
- Interviewing:0
- Invites sent:0
- Unanswered invites:0
About the client
- United StatesAlpharetta8:10 PM
- $25K total spent73 hires, 5 active
- 619 hours
- Tech & ITIndividual client
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by