Infra Engineer – SRE (Kubernetes)

Posted 6 days ago

Worldwide

Summary

Infra Engineer – SRE (Kubernetes) About the Role We are seeking a skilled Site Reliability Engineer specializing in Kubernetes to join a Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of large-scale high-performance AI/ML clusters in data centers. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for infrastructure environments. Experience with large-scale infrastructure automation is considered a strong plus. Responsibilities * Design, implement, and maintain scalable AI/ML infrastructure solutions. * Proactively monitor GPU cluster health, performance, and troubleshoot issues across compute, accelerator, networking, and storage systems. * Automate deployment, configuration, and management of infrastructure resources. * Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning, and upgrades. * Implement CI/CD pipelines for infrastructure deployment and orchestration. * Ensure security, compliance, and operational best practices across infrastructure environments. * Manage incident response related to infrastructure resources, including GPU, CPU, storage, and network components. * Handle customer provisioning requests for GPU resources, including onboarding, configuration, and troubleshooting. * Resolve customer service requests related to infrastructure and platform operations while maintaining high customer satisfaction. * Stay current with emerging GPU hardware and software technologies and integrate improvements where appropriate. * Support regional and international travel requirements to data center locations when necessary. Qualifications * Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. * 3+ years of experience in data center operations, infrastructure engineering, systems engineering, or site reliability engineering. * Proven experience with infrastructure automation tools such as Terraform and Ansible. * Strong experience with Kubernetes and container orchestration technologies. * Familiarity with NVIDIA GPU Operator, NVIDIA Network Operator, CNI, CSI, and similar Kubernetes ecosystem tools. * Experience with job scheduling systems such as Slurm. * Strong Linux system administration skills. * Proficiency in scripting and automation using Python and Bash. * Experience with observability and monitoring platforms such as Prometheus, Grafana, and Loki. * Knowledge of GPU architectures, NVIDIA CUDA, NCCL, and AI/ML infrastructure is a strong advantage. * Strong troubleshooting and root-cause analysis skills with the ability to analyze logs, metrics, and system performance data. * Excellent communication, collaboration, and problem-solving abilities. Preferred Skills * Large-scale Kubernetes cluster operations. * AI/ML infrastructure and GPU cluster management. * Infrastructure-as-Code (IaC) and automation-first mindset. * Production incident management and reliability engineering. * Data center operations and hardware troubleshooting. * CI/CD platform design and implementation. Meeting every qualification is not required. Candidates with strong technical foundations, relevant experience, and a passion for building reliable large-scale infrastructure are encouraged to apply.

  • $500.00

    Fixed-price
  • Intermediate
    Experience Level
  • Remote Job
  • Complex project
    Project Type
Skills and Expertise
Mandatory skills
Embedded System
Cisco Router
Nice-to-have skills
Computer Network
Cisco Certified Network Associate
Activity on this job
  • Proposals:5 to 10
  • Last viewed by client:4 days ago
  • Interviewing:
    4
  • Invites sent:
    6
  • Unanswered invites:
    1
About the client
Member since Jun 23, 2026
  • India
    2:54 AM

Explore similar jobs on Upwork

Chef and Helpers for Biryani and GraviesFixed-price‐ Posted 3 weeks ago
Cooking
Docker
DevOps
Linux System Administration

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo