Re-architect & Build a Lean LLM Conversation Worker

Posted 2 weeks ago

Worldwide

Summary

# Help me rebuild the brain of an AI that sits with people in their hardest moments — lean, fast, and right Hi, and thank you for actually reading this. I'm a solo founder. For the last few months I've been building something I believe in more than anything I've worked on: an AI companion that meets people in the moments most apps ignore — at 3am when the head won't switch off, when something heavy is being carried and there's no one to call. It's not a chatbot and it's not a wellness tracker. It's a space that listens, remembers you across conversations, and responds like someone who actually knows you. It's live. Real people use it. On its best days, it does something rare. But I've hit a wall I can't get past alone, and I'm looking for someone who has genuinely built this kind of system before. ## How I built it A few honest words on the approach, because it shapes the job. I built this **solo and fast, with an AI pair-programmer** — iterating and stacking ideas as I learned. But the core bets were deliberate, not random: - **Two parallel memory systems.** A shared clinical **Knowledge RAG** (vetted frameworks) with **corrective grading (CRAG)** so the AI grounds itself instead of making things up — plus a **per-user Memory RAG** that learns the individual over time: their themes, the people who matter, the patterns underneath what they say. - **A "perception-first" design.** Before the main model answers, a smaller, faster model reads the emotional state of the message — the idea being to behave like a good listener who reads the room before speaking. - **Distilled memory, not transcripts.** After each conversation I extract what *mattered* — themes, patterns, what landed — instead of storing raw text. More private, and sharper for recall. - **Owned infrastructure, one model family.** Supabase + Cloudflare so I control the data, and Anthropic Claude for the voice, because the conversational quality is the whole product. The intent behind all of it: a warm, attentive presence that genuinely remembers you. Every piece was in service of that feeling. ## What it is (the stack, and what runs each turn) - **The product:** an AI mental-health **depth-conversation companion** — an *adjunct to human care, never a replacement for therapy*. In production, with real users. - **The stack:** Cloudflare Workers (the orchestration "brain"), Supabase + pgvector (memory and vectors), Anthropic Claude (Sonnet for the voice, Haiku for background work), OpenAI embeddings, a vanilla-JS PWA. - **What runs on every single turn, before the reply streams:** a crisis check → a "perception" pass that reads the emotional state → a second classifier that decides how memory is used → a four-stream retrieval over the user's stored memory → a clinical knowledge retrieval with self-correction (CRAG) → all of it rendered into a ~23,000-character context block → and *only then* the model answers. **Four model calls, ~35,000 tokens, ~7–8 seconds — per message.** ## Where I think the bottleneck is I'll give you my honest diagnosis, and I want you to confirm it or tear it apart. **I believe the bottleneck is the worker itself** — the single orchestration layer that does everything. There are two sides to it: - **It does too much, synchronously, before it speaks.** That long, mostly-*sequential* chain above — perception, a classifier, multi-stream retrieval, a structured-memory call, knowledge + CRAG, then assembling a 23K-character block — all runs *before* the first word streams. That's the ~7–8 seconds, and it's the most visible failure. - **It computes intelligence it then throws away.** The perception pass reads the moment *accurately* — and the reply often ignores it. So I pay for that work in latency and cost and don't get the benefit. The context block has also grown bloated and self-contradicting, which makes the model behave inconsistently. In short: **the system *knows* more than it *uses*, all in one heavy place.** That's where I think it's broken. You may see it differently — and that's exactly what I want to find out. ## What I need you to do Not an audit. Not a report. **I want you to design and build the lean, smart version** — a worker, or a genuinely different structure, that fixes this at the root. If you believe the right answer is a **completely different mental model** than my stack-of-classifiers-plus-giant-context-block approach, I want to hear it, and build toward it. What "fixed" looks like: - **Fast** — first word in a conversational range, not 7–8 seconds. - **Lean** — far fewer tokens per turn, a context layer that's tight and doesn't argue with itself. - **Coherent** — the AI's read of the moment actually shapes what it says. - **Right** — it recalls the correct thing, in the correct context, with a real sense of time (right now it can treat something you *just said you're about to do* as already done). I'm open to phasing it: a target design and a small proof-of-concept first, then the full build. ## Who I'm hoping to find - Someone who has **designed and shipped a real production LLM system — persistent memory, RAG, multi-step / agentic orchestration — in the last year**, and can show it. - Fluent in **latency engineering for LLM apps** (parallelizing the hot path, streaming, prompt caching, cutting synchronous work). - Real opinions about **context engineering** — knows when a big injected block is the disease, not the cure. - Comfortable with Cloudflare Workers + Supabase/pgvector + Anthropic (or clearly equivalent). ## How to apply (I'll be honest about what gets a reply) I read every word, and I'll only respond to applications that: 1. **Show me the production LLM memory/RAG/agent system you built in the last 12 months** — a link, a repo, or a real write-up. 2. **Speak to *my* situation, not a template** — one concrete thing you'd rebuild here and why, and whether you'd keep my approach or replace it. 3. **Back it up.** I'm looking for someone who has actually done this. Generic, copy-pasted proposals won't get a response — no hard feelings, I just won't. Also i will be getting allot of proposals, so at a certain point there is just that level of over-saturation. I will try my best to answer all, but usually first come first serve. Thank you! ## How we'd work I'll share the architecture and the problems openly (most of it is above). Deeper codebase access comes after we had a deeper conversation — the one piece I protect at first is the "voice," the part that makes it feel human.

Less than 30 hrs/week
Hourly
1-3 months
Duration
Intermediate
Experience Level
$15.00
-
$35.00
Hourly
Remote Job
Ongoing project
Project Type

Skills and Expertise

Mandatory skills

JavaScript

PHP

Nice-to-have skills

Android App Development

Activity on this job

Proposals:50+
Last viewed by client:2 weeks ago
Hires:
1
Interviewing:
1
Invites sent:
0
Unanswered invites:
0

About the client

Member since Mar 10, 2016

Germany
Gundelsheim7:07 AM
$2.7K total spent
10 hires, 3 active
82 hours

Explore similar jobs on Upwork

Heath Rouse Balcony and stairsFixed-price‐ Posted 10 months ago

3D Design

3D Modeling

CAD

Adobe Photoshop

Autodesk 3ds Max

Autodesk Revit

Autodesk AutoCAD

Virtual Reality

3D Animation

Layout Design

Exterior Rendering

Interior Design

Animation

Lumion

Rhinoceros 3D

ADU Plan Check Corrections – Plans Revision (City of Riverside)Hourly‐ Posted 2 months ago

Structural Engineering

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.