AI Infrastructure Engineer for RAG Systems

Posted 4 weeks ago

Worldwide

Summary

We are seeking an AI Infrastructure Engineer to deploy, optimize, and maintain production-grade AI systems with a strong focus on Retrieval-Augmented Generation (RAG). The ideal candidate will have experience in cloud infrastructure, containerization, and CI/CD pipelines. You will work closely with our team to ensure the reliability and scalability of our AI systems. You will own the full stack: from GPU inference serving and vector database integration to production API endpoints. The ideal candidate has already shipped LLM infrastructure in production and understands the real-world challenges of GPU memory limits, multi-user concurrency, retrieval accuracy, and system reliability. Responsibilities Deploy and optimize RAG pipelines end-to-end — from document ingestion and chunking to embedding, vector retrieval, and LLM response generation Configure and run production inference servers (vLLM, llama.cpp, Ollama, TGI, TensorRT-LLM, or SGLang) for open-source LLMs Integrate deployed LLMs with existing RAG backends and vector databases (Pinecone, Qdrant, Chroma, Weaviate, or similar) Optimize GPU inference for VRAM usage, token latency, concurrency throughput, and inference speed Expose production-ready, OpenAI-compatible API endpoints for internal or external consumption Handle CUDA driver setup, GPU scaling decisions, and quantization strategies (GGUF, AWQ, GPTQ, EXL2) Build and maintain embedding pipelines, chunking strategies, and re-ranking workflows Set up monitoring, logging, alerting, and automatic restart/recovery for AI services Containerize AI workloads using Docker and deploy via CI/CD pipelines Advise on hardware upgrades, multi-GPU setups, and cost/performance tradeoffs Document deployment architecture and provide clear handover notes Skills: Strong, proven experience deploying RAG pipelines in production environments Hands-on experience with open-source LLMs: Gemma, Llama, Mistral, Qwen, or similar Proficiency with inference frameworks: vLLM, llama.cpp, Ollama, TGI, or TensorRT-LLM Experience with vector databases and embedding models (semantic search, context window optimization, metadata filtering) CUDA GPU optimization and Linux server environment experience Model quantization experience: GGUF, AWQ, GPTQ, or EXL2 Python proficiency — backend APIs using FastAPI, Flask, or equivalent Docker and containerized deployment experience Cloud infrastructure experience: AWS, GCP, or Azure (compute, storage, networking) Strong debugging skills across GPU, inference, and API layers

More than 30 hrs/week
Hourly
6+ months
Duration
Intermediate
Experience Level
$15.00
-
$20.00
Hourly
Remote Job
Complex project
Project Type

Skills and Expertise

Mandatory skills

Embedded System

Eagle

Nice-to-have skills

Microcontroller Programming

Activity on this job

Proposals:15 to 20
Last viewed by client:3 weeks ago
Interviewing:
4
Invites sent:
1
Unanswered invites:
1

About the client

Member since Oct 1, 2025

IND
Delhi5:56 PM
$359 total spent
2 hires, 1 active
HR & Business Services
Small company (2-9 people)

Explore similar jobs on Upwork

Gen AI Developer (Contract)Fixed-price‐ Posted 1 month ago

AI Agent Development

Python

JavaScript

API

Node.js

Deep Learning

React

PostgreSQL

Quantum Computing Consultant – High-Dimensional Combinatorial Opt…Hourly‐ Posted 3 weeks ago

Quantum Computing

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.