SRE Architect – Multi-Cloud Kubernetes, Observability-as-Code & AI-Driven Incident Tooling
Worldwide
I am looking for an expert SRE Architect to elevate infrastructure reliability, observability, and incident response practices. Our platform serves high-scale customer workloads across multi-cloud Kubernetes environments. Your mission will be two-fold: hardening our core observability infrastructure using an Infrastructure-as-Code (IaC) and GitOps approach, and building next-generation, AI-assisted tooling to streamline our incident triage and response. Own the observability stack (metrics, logs, dashboards, alerting) and manage it entirely as code. Infrastructure & Reliability: Own the reliability of our multi-cloud Kubernetes infrastructure. Actively diagnose complex runtime issues involving latency, memory, GPU utilization, concurrency, and model lifecycles. Operations & Runbooks: Author and refine structured, low-context, safe-to-execute runbooks. Oversee incident response, post-mortems, and remediation tracking. Our Tech Stack Orchestration: Kubernetes (Multi-cloud: EKS, GKE, etc.) Observability: VictoriaMetrics / Prometheus, Loki / ELK, Grafana, Alerting pipelines IaC & GitOps: Terraform, Helm, Flux CD / ArgoCD Incident Management: incident.io (or similar) Requirements Proven track record as an SRE Architect or Principal SRE in high-scale, multi-cloud environments. Deep, production-level expertise with Kubernetes, Terraform, Helm, and GitOps workflows. Strong mastery of the Prometheus/Grafana ecosystem managed via code. Experience building automation scripts or tools (Python, Go, or similar) to hook into incident pipelines—bonus points if you have built LLM/AI-assisted tooling for operations. Excellent capability to navigate ambiguity, establish operational processes, and prioritize pragmatism over over-engineering. Note: Prior Machine Learning (ML) infrastructure experience is a plus. A strong curiosity about serving models at scale is plenty.
- Less than 30 hrs/weekHourly
- 6+ monthsDuration
- IntermediateExperience Level
$5.00
-
$12.00
Hourly- Remote Job
- Ongoing projectProject Type
Skills and Expertise
Activity on this job
- Proposals:10 to 15
- Last viewed by client:5 days ago
- Interviewing:1
- Invites sent:0
- Unanswered invites:0
About the client
- United StatesIrving12:34 PM
- $1.8K total spent21 hires, 3 active
- 143 hours
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by