Mid Go Backend Engineer — GPU Monitoring Specialist (NVIDIA + AMD)
Worldwide
Project Overview: We are a GPU server manufacturer building next-generation NVIDIA B200/B300 and AMD MI300/MI350 servers. We are seeking an expert Backend Engineer to own our entire GPU monitoring integration—including NVIDIA DCGM and AMD ROCm SMI—covering health monitoring, failure detection, and alerting. This is a critical core requirement for our server management platform. What You Will Own : NVIDIA DCGM Integration: B200/B300/H100 tracking, 50+ metrics, and DaemonSet deployment on Kubernetes. AMD ROCm SMI Integration: Building an MI300X/MI350 monitoring adapter in Go. GPU Health Engine: XID error tracking, ECC trending (4-level analysis), PCIe replay, and NVLink health. Failure Scenario Detectors: Developing 6 custom detectors (e.g., GPU not detected on boot, VM crashes, power transients). Utilization Analysis: Idle detection, Tensor Core efficiency, and NVLink throughput analysis. Alerting Engine: 13 configurable alert thresholds per device group with email/Slack/webhook/PagerDuty delivery. Technical Requirements: Advanced Go Expertise: Proven experience building high-performance metrics pipelines, Prometheus exposition formats, and highly concurrent systems. NVIDIA DCGM: Deep familiarity with field IDs, Prometheus exporter (port 9400), and Kubernetes DaemonSet workflows. AMD ROCm SMI: Direct experience, or a strong GPU monitoring background with a clear commitment to quickly master the ROCm stack. Time-Series Data: Proficiency with VictoriaMetrics or Prometheus (remote-write, PromQL, scrape configurations). Deployment: Practical knowledge of Kubernetes DaemonSet deployments for infrastructure monitoring. Hardware Plus: Prior experience working with B200, B300, H100, or MI300 hardware configurations is a strong advantage. Tech Stack: Go 1.22+ • NVIDIA DCGM • AMD ROCm SMI • B200/B300/H100 • MI300X • VictoriaMetrics • Prometheus • Kubernetes DaemonSet • Grafana Application Instructions : To ensure a mutual fit for this specialized infrastructure role, please include the following in your proposal: A brief summary of your direct, hands-on experience with the primary technologies listed (specifically Go and NVIDIA DCGM). One specific example of a relevant production-grade GPU monitoring or infrastructure system you have built. This is a highly specialized, production-ready project requiring immediate technical autonomy. We look forward to reviewing your target hardware experience! Best of Luck
$2,000.00
Fixed-price- ExpertExperience Level
- Remote Job
- Complex projectProject Type
Skills and Expertise
Activity on this job
- Proposals:Less than 5
- Last viewed by client:6 days ago
- Interviewing:3
- Invites sent:2
- Unanswered invites:2
About the client
- IndiaKolkata6:54 PM
- $450 total spent3 hires, 0 active
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by