AI Dev Computer Vision Project

Posted last week

Worldwide

Summary

WHAT WE'RE SEEKING We are seeking a skilled developer to build an offline, real-time assessment system that analyses a user through webcam and microphone input. The system should assess visible communication signals such as posture, gaze/head position, facial cues, hand movement, body movement, and voice delivery cues such as pitch, pace, pauses, volume, and intonation. This is not a cartoon avatar, game character, VR experience, or VFX project. The core deliverable is the local assessment engine. The realistic human-facing interface is a separate presentation layer and should be implemented using the simplest practical method that meets the agreed requirements. THE MISSION Build a local computer vision and voice analysis system that processes live webcam and microphone input in real time, converts those signals into assessment-ready features, applies configurable assessment logic, and produces real-time feedback through a realistic human-facing interface. The human-facing interface may use pre-recorded human video states, a lightweight talking-head interface, a voice-led coach interface, visual overlays, or another local method proposed by the developer. All execution must run locally. No cloud inference, telemetry, analytics, external callouts, or subscription runtime dependencies are permitted. TARGET DEVICE Primary target: Windows 11 consumer hardware. Must be benchmarked on the agreed target machine. GPU acceleration may be used where appropriate. Must not require non-consumer cluster GPUs. Kinect, depth cameras, or specialist hardware may be proposed only as optional enhancements, not default requirements. IN SCOPE - Real-time webcam capture. - Real-time microphone capture. - Webcam-based pose estimation / body landmark tracking. - Face landmark detection. - Hand landmark tracking where practical. - Gaze or head-pose estimation where practical. - Voice feature extraction, including pitch, pace, pauses, volume, energy, and intonation features. - Feature extraction layer that converts raw landmarks/audio into assessment signals. - Configurable assessment logic using local JSON/YAML-style configuration. - Hot reload for assessment logic without restarting the application. - Real-time feedback events based on the assessment logic. - Realistic human-facing interface layer. - Visual debug overlay showing extracted landmarks/features during development and demo. - Local benchmarking of latency, FPS, CPU/GPU usage, RAM usage, and sustained runtime. - Offline verification showing no external calls. - Clear separation between capture, feature extraction, assessment logic, scoring, feedback events, configuration, human interface, and application/UI layers. - Documented interfaces between major components. - Setup documentation, run commands, architecture notes, benchmark notes, and known limitations. TECH STACK Note: We will mutually agree on any tech stack changes to ensure no risk to target expectations and success factors: * Visual engine: Unreal Engine 5 * Human character: MetaHuman first; custom rig later only if needed * Face/lip animation: NVIDIA Audio2Face-3D / ACE Unreal Plugin * Body/gaze/gesture: Unreal animation state machine * Real-time audio/session: LiveKit Agents * Turn-taking: LiveKit turn detector + Silero VAD * TTS: Benchmark Chatterbox Turbo * Backend: Existing backend via clean event bridge * UI: React Tauri 1.8 wrapper - AJ to do vs PYQT6 * External webcam: OBS Virtual Camera only if needed MediaPipe : Local multi-modal framework that extracts face, iris, hand, and body pose landmarks concurrently using device hardware.​ ONNX Runtime (onnxruntime-gpu): High-performance local inference engine optimized for running .onnx models utilizing Windows CUDA or DirectML acceleration.​ OpenCV (opencv-python): Native computer vision library used for real-time webcam frame acquisition, image preprocessing, and drawing visual debug overlays.​ sounddevice: Thread-safe audio library interfacing with Windows WASAPI/ASIO drivers to stream raw microphone input into NumPy arrays.​ Silero VAD: Ultra-lightweight voice activity detection model running locally via ONNX to segment real-time speech and calculate pauses.​ parselmouth (Praat for Python): Local acoustic analysis library utilizing mathematical signal processing to extract fundamental frequency for pitch tracking.​ scipy.signal & numpy: Scientific computing libraries used to perform instantaneous local RMS calculations for real-time volume and energy tracking.​ opensmile: Standardized feature extraction framework optimized in C++ to capture comprehensive acoustic parameter sets completely offline.​ PyQt6 / PySide6: C++ backed GUI framework providing hardware-accelerated widgets (QMediaPlayer) and thread-safe signaling to drive the presentation layer.​ watchdog: Local file monitoring utility that actively monitors file changes and triggers instantaneous hot-reloads of configuration files.​ psutil: System monitoring tool that queries the local OS kernel to benchmark CPU, RAM, and application process utilization.​ pynvml: Direct local wrapper for the NVIDIA Management Library used to programmatically log real-time GPU and VRAM metrics. CLIENT TECHNICAL ENVIRONMENT: * RTX 4080 15GB VRAM * Ryzen 9 4950H * 32GB RAM PROCESSES (to be refined): - User Experience During Session app launch -- zoom like experience (see image) - Post Session User stats evaluation summary (driven by all of the info captured within this project): BPS, Pros, Cons, Body Language Settings menu (icon) - - Content : Load data (subject matter for user to discuss) - Rules config - Presenter Persona : System Prompt - LLM: Choose Ollama Model from drop-down list. User can download local models. Show local models downloaded. Can be deleted. Include synch button to fetch latest models/changes. OUT OF SCOPE - Cloud computer vision APIs. - Cloud speech or voice analysis APIs. - Subscription-based runtime dependencies. - Telemetry, analytics, or external callouts. - Enterprise production-grade deployment. - Full 3D character pipeline unless separately approved. - Skeletal mesh / rigged character development unless separately approved. (no only head, shoulder, arms only) - VFX-heavy digital human rendering unless separately approved. - Kinect or depth-camera dependency as a default requirement. (going into the distance) - Deepfake-style synthesis unless explicitly approved in writing. - Multilingual translation. - Audio transcription beyond what is required for direct voice delivery analysis. - Emotion recognition claims unless legally, technically, and contractually approved. - Hardcoded assessment rules where configuration is practical. IDEAL EXPERIENCE Computer Vision - Real-time pose estimation. - Body landmark / skeletal landmark tracking. - Face landmark detection. - Hand tracking. - Head-pose or gaze-related estimation. - OpenCV, MediaPipe, ONNX Runtime, PyTorch, TensorRT, or equivalent local tooling. Audio / Voice Analysis - Real-time microphone handling. - Pitch, tempo, pause, volume, energy, and intonation feature extraction. - Local DSP/audio feature pipelines. ffmpeg, PyAudio, sounddevice, librosa, torchaudio, or equivalent tooling. Local ML / Runtime Optimisation - Running models locally on consumer hardware. - Optimising for latency and memory usage. - Practical benchmarking and profiling. - ONNX, PyTorch, TensorRT, or equivalent runtime choices. Human Interface - Video-based human presenter interfaces. - Lightweight talking-head interfaces. - Voice-led coaching interfaces. - Real-time feedback UI. - Ability to keep the presentation layer separate from the assessment engine. Architecture - Clean modular design. - Testable component boundaries. - Configuration-driven behaviour. - Clear handover documentation. - Practical trade-off decisions. - Low technical debt MVP delivery. NON-NEGOTIABLES Offline Operation - The system must run locally. - No required internet connection. - No telemetry. - No analytics. - No subscription runtime dependency. Performance - Target end-to-end feedback latency: under 100ms where technically practical. - Candidate must benchmark actual latency on the agreed hardware. - Candidate must benchmark memory usage and sustained runtime. - Candidate must identify any parts that cannot realistically meet the target and explain why. Architecture Quality - The assessment engine and human-facing interface must be separate. - Webcam capture, audio capture, feature extraction, assessment logic, configuration, feedback events, and presentation must not be tightly coupled. - Assessment rules must be configurable where practical. - The MVP must be maintainable, modular, and suitable for clean handover. Communication - Explain technical decisions clearly. - at least 30min video catch up, everyother day. - Flag unclear requirements early. - Flag delivery risks at least one day before they affect delivery. - Shared-screen review sessions may be required. - Live walkthroughs of architecture, benchmarks, and code may be required. - Mutually agree any changes before committing Output Quality - AI-assisted coding is permitted up to a maximum of 60%. - The developer remains responsible for all code quality. - Code must be lean, modular, testable, and documented. - No unsupported implementation choices. - No unnecessary complexity. - No hidden dependencies. Handover - Hand over all source code, assets, configuration files, setup notes, benchmark results, and documentation. - Provide clear run instructions. - Provide dependency and licensing notes. - Provide known limitations. - Remove project files and assets from supplier infrastructure after handover unless written approval is given. DEFINITION OF DONE The project is complete when: - The application runs locally on the agreed Windows 11 hardware. - Webcam and microphone input are captured and processed in real time. - The system extracts pose/body landmarks, face landmarks, and agreed voice delivery features. - The system extracts hand, gaze/head-pose, or equivalent visible communication features where technically practical. - The system converts raw landmarks and audio features into assessment-ready signals. - The system applies configurable local assessment logic. Configuration changes can be hot reloaded without restarting the application. - The system produces real-time feedback events. (BPS, body language, eye-tracking) - The feedback events drive the agreed human-facing interface. - The human-facing interface is separate from the assessment engine. - The system includes a visual debug mode showing key extracted landmarks/features. - Offline testing confirms no network calls (with exception of ollama cloud and initial system install), telemetry, analytics, or cloud dependencies. - Source code is modular and documented. - Handover package is complete. ACCEPTANCE CRITERIA - Live Demo - Demonstrates webcam and microphone input being processed in real time. - Shows extracted body/pose, face, and voice features. - Shows real-time feedback being generated. - Shows feedback driving the agreed human-facing interface. - Configuration Demo - Assessment logic is stored in local configuration files. - A configuration value is changed during the demo. - The system applies the changed logic without application restart. - Offline Evidence - Demo runs with network disabled. - Logs or monitoring evidence show no external calls. - No telemetry or analytics are present. - Benchmark Evidence - Benchmark report includes hardware used, model/runtime choices, FPS, latency, CPU/GPU usage, RAM usage, and sustained runtime. - Any missed performance target is explained with evidence and mitigation options. - Architecture Evidence - Code review confirms separation between capture, feature extraction, assessment logic, configuration, feedback events, human interface, and application/UI layers. - Interfaces between major components are documented. - Architecture notes identify key trade-offs and known limitations. - Source code, assets, configuration files, setup instructions, run commands, benchmark results, and known limitations are provided. - Supplier confirms deletion of retained project materials unless written approval is given PAYMENT CONDITIONS * I’ll only accept work that meets the agreed standards. These will not change from what we've discussed INVOICING * Invoice like this please per week: 25hrs x 24.61 = 615GBP Coding standards: * Must have a clear and true seperation of concerns between FE and BE * This means the front and backend are completely seperate and connected only by an integration layer. * You can make superficial changes to the front layer, with zero risk to the backend. * All files must be under 600 lines long inclusive of comments and single purpose. Only exception are shared utilities * If a file exceeds 600 lines, then consider that it's either inefficient/bloated, or is doing more that one thing * Examples of sinlge purpose: One file is for LLM integration, one file for feature x, another file for feature y (and so on) * All code must include comments within files (applies only where possible, since you cannot add comments to some files due to syntax issues ). This includes: * Before imports: file name with path + points on the unique features of the file * Comments throughout code to explain what is dynamic such that any idiot like me can understand the code * Ensure code is dynamic and flexible - not rigid, fragile or with hardcoded values .

  • $2,000.00

    Fixed-price
  • Expert
    Experience Level
  • Remote Job
  • Ongoing project
    Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more
Skills and Expertise
Mandatory skills
AI Development
PyTorch
Activity on this job
  • Proposals:10 to 15
  • Hires:
    1
  • Interviewing:
    0
  • Invites sent:
    1
  • Unanswered invites:
    1
About the client
Member since Aug 2, 2021
  • United Kingdom
    London1:37 AM
  • 3 hires, 1 active
  • Tech & IT
    Small company (2-9 people)

Explore similar jobs on Upwork

LLM and Prompt Engineering ExpertHourly‐ Posted 3 weeks ago
Python
LLM Prompt Engineering
Python
Machine Learning
AI Agent Development
OpenAI Codex
Agent GPT
Artificial Intelligence
Data Science
Data Analysis
Data Scraping

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo