Build an LLM Pretraining Scaling & Reliability Lab
Worldwide
I am looking for an experienced ML Systems Engineer / Research Engineer to build a portfolio-quality project focused on LLM training operations, not chatbot development. The project should use LLM training framework and extend it into a small but professional training lab. The final system should include: Reproducible small LLM training run Training/validation loss tracking Tokens/sec and step-time metrics GPU/CPU memory monitoring Checkpoint save and resume Failure simulation and recovery Experiment comparison Simple dashboard Clear documentation and handover The goal is to demonstrate how LLM training runs can be started, monitored, debugged, resumed, compared, and documented. This is not a basic GPT clone or chatbot. I want clean ML engineering around pretraining reliability, observability, scaling experiments, and experiment tracking. Expected Deliverables: Framework recommendation Working baseline training run Metrics logging into run folders Simple dashboard for comparing runs Checkpoint/resume test At least 3–4 small experiments Failure/recovery report README and setup guide Short walkthrough/handover Screening Questions: Have you trained GPT/Transformer models before? Have you used PyTorch, LitGPT, nanoGPT, or similar? Can you explain tokens/sec, checkpointing, validation loss, and gradient accumulation? Have you built ML experiment dashboards or tracking systems? Can you provide clean documentation and a walkthrough? Important: I need clean code, clear documentation, and explanation, not just a black-box demo.
$1,500.00
Fixed-price- ExpertExperience Level
- Remote Job
- Complex projectProject Type
Skills and Expertise
Activity on this job
- Proposals:20 to 50
- Last viewed by client:4 weeks ago
- Interviewing:0
- Invites sent:0
- Unanswered invites:0
About the client
- United KingdomLondon9:11 AM
- $9.3K total spent41 hires, 6 active
- 95 hours
- Tech & ITMid-sized company (10-99 people)
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by