Infra Engineer – SRE (Kubernetes)
Worldwide
Infra Engineer – SRE (Kubernetes) About the Role We are seeking a skilled Site Reliability Engineer specializing in Kubernetes to join a Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of large-scale high-performance AI/ML clusters in data centers. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for infrastructure environments. Experience with large-scale infrastructure automation is considered a strong plus. Responsibilities * Design, implement, and maintain scalable AI/ML infrastructure solutions. * Proactively monitor GPU cluster health, performance, and troubleshoot issues across compute, accelerator, networking, and storage systems. * Automate deployment, configuration, and management of infrastructure resources. * Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning, and upgrades. * Implement CI/CD pipelines for infrastructure deployment and orchestration. * Ensure security, compliance, and operational best practices across infrastructure environments. * Manage incident response related to infrastructure resources, including GPU, CPU, storage, and network components. * Handle customer provisioning requests for GPU resources, including onboarding, configuration, and troubleshooting. * Resolve customer service requests related to infrastructure and platform operations while maintaining high customer satisfaction. * Stay current with emerging GPU hardware and software technologies and integrate improvements where appropriate. * Support regional and international travel requirements to data center locations when necessary. Qualifications * Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. * 3+ years of experience in data center operations, infrastructure engineering, systems engineering, or site reliability engineering. * Proven experience with infrastructure automation tools such as Terraform and Ansible. * Strong experience with Kubernetes and container orchestration technologies. * Familiarity with NVIDIA GPU Operator, NVIDIA Network Operator, CNI, CSI, and similar Kubernetes ecosystem tools. * Experience with job scheduling systems such as Slurm. * Strong Linux system administration skills. * Proficiency in scripting and automation using Python and Bash. * Experience with observability and monitoring platforms such as Prometheus, Grafana, and Loki. * Knowledge of GPU architectures, NVIDIA CUDA, NCCL, and AI/ML infrastructure is a strong advantage. * Strong troubleshooting and root-cause analysis skills with the ability to analyze logs, metrics, and system performance data. * Excellent communication, collaboration, and problem-solving abilities. Preferred Skills * Large-scale Kubernetes cluster operations. * AI/ML infrastructure and GPU cluster management. * Infrastructure-as-Code (IaC) and automation-first mindset. * Production incident management and reliability engineering. * Data center operations and hardware troubleshooting. * CI/CD platform design and implementation. Meeting every qualification is not required. Candidates with strong technical foundations, relevant experience, and a passion for building reliable large-scale infrastructure are encouraged to apply.
$500.00
Fixed-price- IntermediateExperience Level
- Remote Job
- Complex projectProject Type
Skills and Expertise
Activity on this job
- Proposals:5 to 10
- Last viewed by client:4 days ago
- Interviewing:4
- Invites sent:6
- Unanswered invites:1
About the client
- India2:54 AM
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by