Mid Go Backend Engineer — GPU Monitoring Specialist (NVIDIA + AMD)

Posted 4 weeks ago

Worldwide

Needs to hire 2 Freelancers

Summary

Project Overview: We are a GPU server manufacturer building next-generation NVIDIA B200/B300 and AMD MI300/MI350 servers. We are seeking an expert Backend Engineer to own our entire GPU monitoring integration—including NVIDIA DCGM and AMD ROCm SMI—covering health monitoring, failure detection, and alerting. This is a critical core requirement for our server management platform. What You Will Own : NVIDIA DCGM Integration: B200/B300/H100 tracking, 50+ metrics, and DaemonSet deployment on Kubernetes. AMD ROCm SMI Integration: Building an MI300X/MI350 monitoring adapter in Go. GPU Health Engine: XID error tracking, ECC trending (4-level analysis), PCIe replay, and NVLink health. Failure Scenario Detectors: Developing 6 custom detectors (e.g., GPU not detected on boot, VM crashes, power transients). Utilization Analysis: Idle detection, Tensor Core efficiency, and NVLink throughput analysis. Alerting Engine: 13 configurable alert thresholds per device group with email/Slack/webhook/PagerDuty delivery. Technical Requirements: Advanced Go Expertise: Proven experience building high-performance metrics pipelines, Prometheus exposition formats, and highly concurrent systems. NVIDIA DCGM: Deep familiarity with field IDs, Prometheus exporter (port 9400), and Kubernetes DaemonSet workflows. AMD ROCm SMI: Direct experience, or a strong GPU monitoring background with a clear commitment to quickly master the ROCm stack. Time-Series Data: Proficiency with VictoriaMetrics or Prometheus (remote-write, PromQL, scrape configurations). Deployment: Practical knowledge of Kubernetes DaemonSet deployments for infrastructure monitoring. Hardware Plus: Prior experience working with B200, B300, H100, or MI300 hardware configurations is a strong advantage. Tech Stack: Go 1.22+ • NVIDIA DCGM • AMD ROCm SMI • B200/B300/H100 • MI300X • VictoriaMetrics • Prometheus • Kubernetes DaemonSet • Grafana Application Instructions : To ensure a mutual fit for this specialized infrastructure role, please include the following in your proposal: A brief summary of your direct, hands-on experience with the primary technologies listed (specifically Go and NVIDIA DCGM). One specific example of a relevant production-grade GPU monitoring or infrastructure system you have built. This is a highly specialized, production-ready project requiring immediate technical autonomy. We look forward to reviewing your target hardware experience! Best of Luck

$2,000.00
Fixed-price
Expert
Experience Level
Remote Job
Complex project
Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more

Skills and Expertise

Mandatory skills

Golang

PostgreSQL

Redis

Activity on this job

Proposals:Less than 5
Last viewed by client:6 days ago
Interviewing:
3
Invites sent:
2
Unanswered invites:
2

About the client

Member since May 29, 2014

India
Kolkata6:54 PM
$450 total spent
3 hires, 0 active

Explore similar jobs on Upwork

AI-Savvy WordPress Backend VA (PHP, Templates, MySQL, Plugin Trou…Hourly‐ Posted 3 months ago

Git

WordPress

PHP

MySQL

JavaScript

Set up sellers.json fileFixed-price‐ Posted 3 weeks ago

JSON

JavaScript

Advertising Networks

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.