Mid Go Backend Engineer — GPU Monitoring Specialist (NVIDIA + AMD)

Posted 4 weeks ago

Worldwide

Needs to hire 2 Freelancers
Summary

Project Overview: We are a GPU server manufacturer building next-generation NVIDIA B200/B300 and AMD MI300/MI350 servers. We are seeking an expert Backend Engineer to own our entire GPU monitoring integration—including NVIDIA DCGM and AMD ROCm SMI—covering health monitoring, failure detection, and alerting. This is a critical core requirement for our server management platform. What You Will Own : NVIDIA DCGM Integration: B200/B300/H100 tracking, 50+ metrics, and DaemonSet deployment on Kubernetes. AMD ROCm SMI Integration: Building an MI300X/MI350 monitoring adapter in Go. GPU Health Engine: XID error tracking, ECC trending (4-level analysis), PCIe replay, and NVLink health. Failure Scenario Detectors: Developing 6 custom detectors (e.g., GPU not detected on boot, VM crashes, power transients). Utilization Analysis: Idle detection, Tensor Core efficiency, and NVLink throughput analysis. Alerting Engine: 13 configurable alert thresholds per device group with email/Slack/webhook/PagerDuty delivery. Technical Requirements: Advanced Go Expertise: Proven experience building high-performance metrics pipelines, Prometheus exposition formats, and highly concurrent systems. NVIDIA DCGM: Deep familiarity with field IDs, Prometheus exporter (port 9400), and Kubernetes DaemonSet workflows. AMD ROCm SMI: Direct experience, or a strong GPU monitoring background with a clear commitment to quickly master the ROCm stack. Time-Series Data: Proficiency with VictoriaMetrics or Prometheus (remote-write, PromQL, scrape configurations). Deployment: Practical knowledge of Kubernetes DaemonSet deployments for infrastructure monitoring. Hardware Plus: Prior experience working with B200, B300, H100, or MI300 hardware configurations is a strong advantage. Tech Stack: Go 1.22+ • NVIDIA DCGM • AMD ROCm SMI • B200/B300/H100 • MI300X • VictoriaMetrics • Prometheus • Kubernetes DaemonSet • Grafana Application Instructions : To ensure a mutual fit for this specialized infrastructure role, please include the following in your proposal: A brief summary of your direct, hands-on experience with the primary technologies listed (specifically Go and NVIDIA DCGM). One specific example of a relevant production-grade GPU monitoring or infrastructure system you have built. This is a highly specialized, production-ready project requiring immediate technical autonomy. We look forward to reviewing your target hardware experience! Best of Luck

  • $2,000.00

    Fixed-price
  • Expert
    Experience Level
  • Remote Job
  • Complex project
    Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more
Skills and Expertise
Mandatory skills
Golang
PostgreSQL
Redis
Activity on this job
  • Proposals:Less than 5
  • Last viewed by client:6 days ago
  • Interviewing:
    3
  • Invites sent:
    2
  • Unanswered invites:
    2
About the client
Member since May 29, 2014
  • India
    Kolkata6:54 PM
  • $450 total spent
    3 hires, 0 active

Explore similar jobs on Upwork

Git
WordPress
PHP
MySQL
JavaScript
Set up sellers.json fileFixed-price‐ Posted 3 weeks ago
JSON
JavaScript
Advertising Networks

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo