SRE Architect – Multi-Cloud Kubernetes, Observability-as-Code & AI-Driven Incident Tooling

Posted 6 days ago

Worldwide

Summary

I am looking for an expert SRE Architect to elevate infrastructure reliability, observability, and incident response practices. Our platform serves high-scale customer workloads across multi-cloud Kubernetes environments. Your mission will be two-fold: hardening our core observability infrastructure using an Infrastructure-as-Code (IaC) and GitOps approach, and building next-generation, AI-assisted tooling to streamline our incident triage and response. Own the observability stack (metrics, logs, dashboards, alerting) and manage it entirely as code. Infrastructure & Reliability: Own the reliability of our multi-cloud Kubernetes infrastructure. Actively diagnose complex runtime issues involving latency, memory, GPU utilization, concurrency, and model lifecycles. Operations & Runbooks: Author and refine structured, low-context, safe-to-execute runbooks. Oversee incident response, post-mortems, and remediation tracking. Our Tech Stack Orchestration: Kubernetes (Multi-cloud: EKS, GKE, etc.) Observability: VictoriaMetrics / Prometheus, Loki / ELK, Grafana, Alerting pipelines IaC & GitOps: Terraform, Helm, Flux CD / ArgoCD Incident Management: incident.io (or similar) Requirements Proven track record as an SRE Architect or Principal SRE in high-scale, multi-cloud environments. Deep, production-level expertise with Kubernetes, Terraform, Helm, and GitOps workflows. Strong mastery of the Prometheus/Grafana ecosystem managed via code. Experience building automation scripts or tools (Python, Go, or similar) to hook into incident pipelines—bonus points if you have built LLM/AI-assisted tooling for operations. Excellent capability to navigate ambiguity, establish operational processes, and prioritize pragmatism over over-engineering. Note: Prior Machine Learning (ML) infrastructure experience is a plus. A strong curiosity about serving models at scale is plenty.

  • Less than 30 hrs/week
    Hourly
  • 6+ months
    Duration
  • Intermediate
    Experience Level
  • $5.00

    -

    $12.00

    Hourly
  • Remote Job
  • Ongoing project
    Project Type
Skills and Expertise
Mandatory skills
Automated Monitoring
AIOps
Activity on this job
  • Proposals:10 to 15
  • Last viewed by client:5 days ago
  • Interviewing:
    1
  • Invites sent:
    0
  • Unanswered invites:
    0
About the client
Member since Mar 1, 2017
  • United States
    Irving12:34 PM
  • $1.8K total spent
    21 hires, 3 active
  • 143 hours

Explore similar jobs on Upwork

Chef and Helpers for Biryani and GraviesFixed-price‐ Posted 4 weeks ago
Cooking
Docker
DevOps
Linux System Administration

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo