AI Infrastructure Engineer for RAG Systems
Worldwide
We are seeking an AI Infrastructure Engineer to deploy, optimize, and maintain production-grade AI systems with a strong focus on Retrieval-Augmented Generation (RAG). The ideal candidate will have experience in cloud infrastructure, containerization, and CI/CD pipelines. You will work closely with our team to ensure the reliability and scalability of our AI systems. You will own the full stack: from GPU inference serving and vector database integration to production API endpoints. The ideal candidate has already shipped LLM infrastructure in production and understands the real-world challenges of GPU memory limits, multi-user concurrency, retrieval accuracy, and system reliability. Responsibilities Deploy and optimize RAG pipelines end-to-end — from document ingestion and chunking to embedding, vector retrieval, and LLM response generation Configure and run production inference servers (vLLM, llama.cpp, Ollama, TGI, TensorRT-LLM, or SGLang) for open-source LLMs Integrate deployed LLMs with existing RAG backends and vector databases (Pinecone, Qdrant, Chroma, Weaviate, or similar) Optimize GPU inference for VRAM usage, token latency, concurrency throughput, and inference speed Expose production-ready, OpenAI-compatible API endpoints for internal or external consumption Handle CUDA driver setup, GPU scaling decisions, and quantization strategies (GGUF, AWQ, GPTQ, EXL2) Build and maintain embedding pipelines, chunking strategies, and re-ranking workflows Set up monitoring, logging, alerting, and automatic restart/recovery for AI services Containerize AI workloads using Docker and deploy via CI/CD pipelines Advise on hardware upgrades, multi-GPU setups, and cost/performance tradeoffs Document deployment architecture and provide clear handover notes Skills: Strong, proven experience deploying RAG pipelines in production environments Hands-on experience with open-source LLMs: Gemma, Llama, Mistral, Qwen, or similar Proficiency with inference frameworks: vLLM, llama.cpp, Ollama, TGI, or TensorRT-LLM Experience with vector databases and embedding models (semantic search, context window optimization, metadata filtering) CUDA GPU optimization and Linux server environment experience Model quantization experience: GGUF, AWQ, GPTQ, or EXL2 Python proficiency — backend APIs using FastAPI, Flask, or equivalent Docker and containerized deployment experience Cloud infrastructure experience: AWS, GCP, or Azure (compute, storage, networking) Strong debugging skills across GPU, inference, and API layers
- More than 30 hrs/weekHourly
- 6+ monthsDuration
- IntermediateExperience Level
$15.00
-
$20.00
Hourly- Remote Job
- Complex projectProject Type
Skills and Expertise
Activity on this job
- Proposals:15 to 20
- Last viewed by client:3 weeks ago
- Interviewing:4
- Invites sent:1
- Unanswered invites:1
About the client
- INDDelhi5:56 PM
- $359 total spent2 hires, 1 active
- HR & Business ServicesSmall company (2-9 people)
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by