Top 1% Ai Engineer - Build an LLM based Knowledge OS

Posted last month

Worldwide

Summary

# Top 1% AI Engineer - Build a Company Knowledge OS Engineer ## Company Overview We are a fast-growing AI-native firm working with executives, operators, and enterprise teams to redesign how mission-critical work gets done with AI. We move quickly, care deeply about high-quality execution, and build practical systems where every meeting, decision, and signal makes the next one faster, sharper, and more reliable. Internally we operate on top of a custom-built LLM knowledge operating system that is already in daily production use. ## Opportunity We are looking for an exceptional AI engineer — top one percent, founding-engineer caliber — to help us architect, harden, and extend our internal Knowledge Operating System and ship a steady stream of agent-powered tools on top of it. You will be a true thought partner to the founder on the system itself — and you will build a stream of agents and tools that compound how the firm operates. This is a high-leverage opportunity for someone who is genuinely AI-obsessed, opinionated about ontology and context engineering, and energized by building things that get smarter every time they run. Outstanding performers will be considered for expanded or longer-term opportunities, including potential full-time conversion. ## Scope of Work - Harden the Knowledge OS substrate: markdown-as-source-of-truth plus a Postgres or DuckDB projection layer for fast graph queries, idempotent sync pipelines, schema evolution that does not break existing files, and drift detection. - Improve the agent harness layer: Cursor and Claude Code rules, skills, hooks, and commands; agent loops with proper context assembly, tool design, and evaluation; long-running sandbox agents with snapshot and resume; MCP servers that expose the graph cleanly. - Build the memory architecture: working, thread, per-user, and organization-level memory layers, composed via tools rather than collapsed into a single retrieval bag. - Architect retrieval that earns its keep: hybrid lexical, vector, graph, and frontmatter-aware retrieval, reranking, confidence-aware synthesis, and end-to-end provenance preservation. - Stand up evaluation and observability: OpenTelemetry traces from Claude Desktop and Cursor into Phoenix or LangFuse, surfacing redundant agents, skill candidates, and regression patterns. - Implement closed-loop improvement: skill discovery from real usage data, workflow drift detection, and a claim lifecycle (created, reinforced, contradicted, superseded, resolved) that actually fires in production. - Ship cool side-quests on top of the system: an autonomous email-drafting harness, an agent-authored daily briefing that gets sharper every week, a meeting-prep generator, a pattern-promotion engine that finds cross-context insights a human would miss, and other agents that earn their place by saving real time on real work. - Write a lot of Python. Write a lot of YAML. Write a lot of Markdown. Write Cursor rules, Claude skills, and MCP servers as fluently as you write code. Think hard about ontology, provenance, and the shape of agent context — because that is where the system either becomes a moat or becomes another note app. ## Must-Haves - Demonstrated experience personally architecting at least three production systems involving real LLMs, real users, and real evaluation. Not proofs of concept. - Deep Python fluency: data pipelines, async, schema validation, parsing. This is the load-bearing language. - Hands-on LLM application engineering: Anthropic and OpenAI APIs, function calling, structured output, streaming, and eval design. - Real fluency in modern agent harnesses: Cursor, Claude Code, OpenCode, or comparable. You have shipped agent loops in production, not just demos. - MCP fluency: you can write servers and tools, not just consume them. - Retrieval systems experience: embeddings, vector stores such as pgvector, Qdrant, or Weaviate, reranking, and hybrid search. - Data engineering chops: Postgres, DuckDB, or SQLite, schema design, idempotent ETL, and event-sourced thinking. - Knowledge representation instincts: graph models, ontologies, entity resolution, claim and evidence modeling. You have thought hard about temporal data. - Opinionated systems thinking: you push back when the brief is wrong about ontology, schema, or agent design — and back it up with reasoning and code. - Speed: you operate in one to two week prototype cycles. Ship, evaluate, iterate. No quarterly-planning theater. - You read doctrine carefully before writing code. The system has a written architectural doctrine; if your instinct is to skim and ship, this is not the gig. ## Nice-to-Haves - Observability tooling experience: OpenTelemetry, Phoenix, LangFuse, or comparable eval harnesses. - Sandbox runtimes: E2B, Modal, or comparable; you have run untrusted long-running agent code in production. - TypeScript, React, or Next.js for any UI surfaces we end up shipping. Not the main job, but you can do it when needed. - Experience with services-to-platform extractions, internal tools that became external products, or founding-engineer work at an AI-first company. - A point of view on autarky versus managed services: you instinctively distrust anything that locks the canonical memory layer to a vendor, and can articulate why. - Familiarity with confidentiality and audience-routing constraints in multi-tenant or multi-audience knowledge systems. ## What We're Looking For in a Person We are looking for a founding-engineer-caliber builder who is genuinely obsessed with AI, opinionated about systems, and energized by the problem of making a business legible to itself. The right person can hold provenance, confidence, temporal context, and audience boundaries in their head simultaneously without dropping any of them. They write production code regularly — they do not just review it. They ask sharp clarifying questions, propose better solutions when the brief is suboptimal, and ship things that get smarter every time they run. This person should be equally comfortable architecting a graph schema, writing an MCP server, debugging a sandbox memory leak, and arguing with the founder about whether a claim lifecycle transition is correctly modeled. They should care about craft, reliability, evaluation, and the experience of the people and agents using the system. Low ego, high bias for action, high standards. ## Screening Questions 1. In one paragraph (max five sentences), describe the most architecturally interesting LLM-based system you have personally built. Not what your team built — what you built. Tell us what was hard about it and what you would do differently next time. 2. What is your point of view on agent memory architecture? Specifically, how do you separate working memory, thread memory, per-user memory, and organization-level memory in a system you have built or would build?

More than 30 hrs/week
Hourly
3-6 months
Duration
Expert
Experience Level
$40.00
-
$200.00
Hourly
Remote Job
Ongoing project
Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more

Skills and Expertise

Mandatory skills

API Integration

Artificial Intelligence

Activity on this job

Proposals:50+
Last viewed by client:3 weeks ago
Hires:
1
Interviewing:
15
Invites sent:
30
Unanswered invites:
9

About the client

Member since Jul 16, 2018

United States
New York11:47 AM
$299K total spent
84 hires, 21 active
5,881 hours

Explore similar jobs on Upwork

Paid Interview: LangSmith Fleet UsersHourly‐ Posted 2 weeks ago

LangChain

Pylon Specialist for Account SetupHourly‐ Posted 2 weeks ago

API Integration

Integration Testing

Project Management Software

Automated Workflow

Slack

Organizational Design & Effectiveness

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.