AI Infrastructure Architect for Multi-Model Router and Inference Cost Optimization

Posted 3 days ago

Only freelancers located in the U.S. may apply.U.S. located freelancers only

Summary

We are building a confidential AI platform and need a senior architect to lead the design and build of a multi-model routing layer. The router sits in front of multiple AI systems and decides, for every request (text, image, video), the cheapest path that still meets quality: cache, reuse, small or local model, open-source model, fine-tuned model, or a higher-cost fallback. The north-star metric is the percentage of requests served without hitting an expensive frontier GPU, and the resulting cost reduction on a representative workload. This is not a chatbot project. You will own the architecture, define the routing logic, and lead execution alongside a separate model-training team. We need someone who thinks in systems, not individual model calls. What You Will Own - Router architecture: request intake and normalization, classification, routing taxonomy, model-selection rules, fallback logic, cache and reuse rules, logging and telemetry, and the evaluation feedback loop. - The interface must be model-agnostic, so fine-tuned and open-source models can be swapped in and out without rearchitecting. - Cost optimization: reduce unnecessary GPU usage while preserving quality, using exact and semantic cache, existing output reuse, lightweight model routing, batching, local or edge execution where appropriate, and a clear fallback hierarchy. - Caching text is straightforward, but caching generative image and video is not, since the same prompt should produce variation rather than an identical result. You need a credible approach to reuse at the asset or component level, not just for text. -Evaluation loop: a framework that scores output quality by content domain and flags weakness, so the training team can target improvements instead of retraining broadly. Track output quality against intent, failure modes, cost per route, latency per route, cache hit rate, fallback rate, and regeneration rate. - Execution plan and leadership: an architecture diagram, recommended POC scope, milestones, infrastructure assumptions, and risks that leadership can review, plus hands-on architecture review and task breakdown to guide the engineering team through the build. Ideal Background - You have led or architected production AI infrastructure involving several of the following: multi-model orchestration and LLM routing, multimodal AI, model serving, inference cost optimization, GPU cost reduction, open-source and fine-tuned model deployment, evaluation pipelines, semantic caching, and AI observability. - Strong candidates may have experience with Python and a modern serving and infrastructure stack such as vLLM, Triton, Ray Serve, BentoML, Kubernetes, Docker, vector databases, Redis or similar caching layers, and evaluation tracking with MLflow or Weights and Biases. Specific tools matter less than the ability to architect the system correctly and lead execution. We are not looking for someone who only builds basic chatbot workflows, only uses hosted APIs without understanding the underlying infrastructure, or works as a prompt engineer alone. Deliverables - The initial engagement should produce a router architecture blueprint, a routing taxonomy, a POC execution plan with milestones and success criteria, and an evaluation and feedback framework, followed by technical leadership through the POC build. Screening Questions 1. Describe the most relevant AI routing, model-serving, or inference infrastructure system you have personally designed or built. What was routed, what models or execution paths were involved, and what role did you own? 2. How would you design a router that decides whether a request should use cache/reuse, a smaller model, an open-source or fine-tuned model, or a higher-cost fallback? 3. For generative image or video requests, how would you approach caching or reuse when the same prompt should still allow variation? Please be specific. 4. What metrics and evaluation loop would you use to prove the router is reducing cost without degrading output quality, and to help a separate model-training team identify weaknesses? To Apply Answer the questions above to the best of your ability. Summarize your most relevant routing or inference-infrastructure work, link any repos or examples, give your high-level approach to a router that cuts GPU usage while preserving quality, and note your availability.

More than 30 hrs/week
Hourly
1-3 months
Duration
Expert
Experience Level
Remote Job
Ongoing project
Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more

Skills and Expertise

Mandatory skills

Machine Learning

MLOps

Activity on this job

Proposals:20 to 50
Last viewed by client:yesterday
Interviewing:
1
Invites sent:
0
Unanswered invites:
0

About the client

Member since Aug 2, 2015

United States
Alpharetta5:07 AM
$25K total spent
72 hires, 5 active
619 hours
Tech & IT
Individual client

Explore similar jobs on Upwork

Computer Science InternshipFixed-price‐ Posted 3 weeks ago

Content Writing

Mathematics Tutoring

Writing

Chemistry

AI Instructor for Teaching AI SkillsFixed-price‐ Posted 4 weeks ago

Artificial Intelligence

Machine Learning

Generative AI

Prompt Engineering

LLM Prompt Engineering

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.