Senior AI Engineer for Critical Production Memory Leak Resolution

Posted 6 hours ago

Worldwide

Summary

We are looking for an experienced AI Engineer to join our team immediately to lead the investigation and resolution of a critical production issue involving a long running persistent memory leak affecting our AI platform. This is a high impact role for someone who has deep experience debugging complex AI systems in production environments. You will work directly with our engineering team to identify the root cause, implement a robust solution, validate the fix under production workloads, and ensure long term platform stability. ## Responsibilities * Investigate and resolve a persistent memory leak in a production AI system. * Perform deep root cause analysis across application code, AI frameworks, runtime environments, and infrastructure. * Profile CPU and memory usage using advanced debugging and performance analysis tools. * Identify memory retention issues, object lifecycle problems, resource leaks, and concurrency related bottlenecks. * Optimize long running AI services for reliability, performance, and efficient resource utilization. * Validate fixes through stress testing and production level workload simulations. * Collaborate closely with backend, infrastructure, and platform engineers. * Document findings, recommendations, and preventive measures to improve long term system reliability. ## Required Experience * Extensive experience building and operating AI systems in production. * Strong expertise with Python and asynchronous programming. * Deep understanding of memory management, garbage collection, object lifecycle, and profiling techniques. * Experience debugging memory leaks in long running services. * Strong knowledge of AI frameworks such as PyTorch, TensorFlow, Hugging Face Transformers, LangChain, or similar technologies. * Experience with containerized environments including Docker and Kubernetes. * Familiarity with Linux performance analysis and production debugging tools. * Experience working with distributed systems, background workers, APIs, and high availability services. * Ability to quickly isolate complex production issues and deliver reliable long term solutions. ## Preferred Qualifications * Experience debugging GPU memory issues and CUDA memory management. * Experience with vector databases, inference servers, and large language model deployments. * Familiarity with observability platforms including Prometheus, Grafana, OpenTelemetry, or similar monitoring solutions. * Experience improving production reliability for enterprise scale AI platforms. ## What Success Looks Like The successful candidate will identify the root cause of the production memory leak, implement a verified long term fix, improve overall system stability and performance, and help establish engineering practices that prevent similar issues in the future. This is a mission critical engagement requiring exceptional debugging skills, production engineering experience, and a disciplined approach to solving complex AI infrastructure problems.

  • Less than 30 hrs/week
    Hourly
  • < 1 month
    Duration
  • Expert
    Experience Level
  • $25.00

    -

    $50.00

    Hourly
  • Remote Job
  • One-time project
    Project Type
Skills and Expertise
Mandatory skills
Kubernetes
AWS Lambda
Python
Activity on this job
  • Proposals:Less than 5
  • Last viewed by client:5 hours ago
  • Hires:
    1
  • Interviewing:
    0
  • Invites sent:
    0
  • Unanswered invites:
    0
About the client
Member since Apr 14, 2026
  • Pakistan
    Karachi6:16 PM
  • 1 hire, 1 active

Explore similar jobs on Upwork

Set up sellers.json fileFixed-price‐ Posted 3 weeks ago
JSON
JavaScript
Advertising Networks
Application InstallationFixed-price‐ Posted 2 weeks ago
Android
Smartphone
Tablet
iPhone

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo