Senior Data Engineer — Review & Harden a Python/Iceberg Ingestion Pipeline (Pre-Pilot)

Posted 3 weeks ago

Only freelancers located in the U.S. may apply.U.S. located freelancers only

Summary

DESCRIPTION; I'm building a data infrastructure product for ontology-driven AI context: object types, properties, and relationships materialized ahead of query time, so AI systems retrieve connected context fast instead of rebuilding it from raw sources on every request. I need experienced eyes on the ingestion foundation before anything gets built on top of it. The deliverables are fixed (below); hours are flexible — propose what you think the work honestly takes. Rate: my budget is $50–75/hr. That's a hard ceiling — proposals above that range can't be afforded and won't be considered, regardless of quality __________________________________________________________________________ WHO SHOULD APPLY A data engineer / data infrastructure engineer who understands what an ontology and a knowledge graph are and why they matter for AI systems — connected entities and relationships as first-class context, not just tables. You don't need graph database experience; you need to get why pre-materialized, relationship-aware data beats rebuilding context from raw sources on every query. If that framing clicks for you, you're the right kind of applicant. __________________________________________________________________________ THE PRODUCT, HIGH LEVEL: The platform deploys on a client's own infrastructure — we never see their data. Clients connect their data sources, define an ontology (object types, properties, relationships), and the platform materializes it across tiered storage. Later phases add a binary serve layer, SSD/RAM caching, and GPU-parallel query execution so AI systems and data applications retrieve connected context at very low latency. Target customers: companies running AI on complex connected data (security operations, healthcare, financial services) where privacy demands private deployment and speed matters. Storage note: the current prototype uses Iceberg on GCS for development convenience, but the architecture is intentionally built for any S3-compatible storage (on-prem S3, private cloud VPC, MinIO, etc.). Portability is a design requirement, not an afterthought — the platform must never be tied to a single cloud provider. __________________________________________________________________________ WHAT EXISTS TODAY: A working Python prototype: FastAPI, PyIceberg, PyArrow, Postgres, Supabase (metadata + sync ledger), GCS as the Iceberg warehouse. Architecture and design docs are provided for orientation. The cold path is functional and tested: a 31-test production suite ran against live infrastructure at 1M–5M row scale — core correctness, concurrency, failure injection (kill mid-sync, storage outages, lease expiry), idempotency/replay, rollback, a 50-sync soak, and audit checks. All passing, with a written sign-off document you'll receive. That's exactly why I'm hiring you: tests confirm behavior I anticipated. You're here for what I didn't anticipate — structural weaknesses, hidden risks, and edge cases that a test suite written by the same mind that wrote the pipeline can't catch. I'm strong on product and systems design, not low-level data engineering. The codebase is AI-assisted, and I want a professional to find what that typically accumulates. This is a prototype built from the ground up — no live client today. The goal: ensure the ingestion foundation is genuinely solid (data coming in from source correctly, at scale, repeatedly) so a scoped MVP pilot and beta release won't break under real usage. You are validating the foundation before anything gets built on top. __________________________________________________________________________ YOUR SCOPE — THE COLD PATH, END TO END Data source → validation → identity merge → materialized ontology in Iceberg on S3-compatible storage. The data connectors are in scope — they ARE Milestone 1. The platform supports exactly three ways data comes in, and your job includes confirming each one is genuinely production-grade, not just demo-grade: Postgres — full refresh and incremental watermark sync S3-compatible object storage (CSV) — currently GCS via S3 interop, but must work against any S3-compatible store (on-prem, MinIO, private VPC) Manual CSV upload — primarily for testing/onboarding For each connector, production-grade means: real error handling (bad credentials, unreachable source, permission failures, malformed/garbage data, schema drift), clear failure messages that tell a user what broke, no silent partial ingests, and sane retry/recovery behavior. If a connector swallows errors, loses rows quietly, or fails confusingly — that's exactly the finding I'm paying for. No other connectors are planned for this milestone. Three connectors that work correctly under stress beats ten that mostly work. Focus areas across the pipeline: Connectors — production-readiness and error handling as described above Identity & matching — entities staying consistent across syncs (PK merge, fingerprint mode, composite keys) Sync semantics — full refresh vs incremental watermark sync, replay idempotency, delete behavior Relationships — FK→PK edge materialization, rebuild triggers, orphan handling, stable node identity Versioning & audit — Iceberg snapshots, rollback, schema change lineage, sync ledger completeness Reliability — failure modes, partial writes, lock/lease behavior, silent wrong-data risks Code structure — dead code, duplication, coupling, fragility; source-specific logic must stay contained in each connector and never leak into the shared pipeline Explicitly out of scope: GPU execution, query kernels, binary serve formats, caching layers, query-time serving, and any new connector types — all future phases. Your scope ends at correct, versioned, audited data in Iceberg. __________________________________________________________________________ DELIVERABLES (in priority order) Prioritized written assessment — what's pilot-ready as-is, what must be fixed before a real pilot customer (with specific recommendations), and what the existing test suite missed (edge cases, risks, gaps). Active code changes — implement fixes for the highest-priority issues you find, directly in the repo. You'll have full repo access. I'm open to architecture changes and refinements as long as they're clearly explained with reasoning. A change log that teaches — for every change: what you changed, why it mattered, what it fixes or prevents, and what to watch for going forward. This isn't paperwork — I'm making a local engineering hire for the next milestone, and your write-ups become the onboarding record. Everyone who touches this codebase after you should learn from what you found. Fixes go deepest-risk-first. What you get from me: repo access, architecture/design docs, the test suite + sign-off report, and async availability for questions. __________________________________________________________________________ ***REQUIRED EXPERIENCE: 1)Production Python data pipelines 2)Apache Iceberg, Delta Lake, or Hudi (or strong Parquet/data-lake work) 3)Postgres 4)Merge/upsert, idempotency, watermark/CDC patterns Building or hardening data connectors that real users depend on************* __________________________________________________________________________ WHERE THIS CAN GO: This starts as a fixed-scope review. Separately, I plan to make my first part-time/full-time engineering hire locally (Dallas) to build Milestone 2 and beyond — SSD caching, serve layers, containerization, and microservices as the platform scales. For the right freelancer, there's opportunity to stay engaged on recurring scoped work — reviewing the foundation as it evolves and working in conjunction with that future hire. Not required, not promised — but the door is open if the work is strong. __________________________________________________________________________ *********HOW TO APPLY — READ CAREFULLY***** Answer this one question in your proposal, briefly and in your own words: "You're building a pipeline that ingests from Postgres and S3-compatible storage and materializes a connected ontology (entities + relationships) into Iceberg. How do you design the sync process to be reliable and idempotent — especially around watermarking, commits, and failure handling between steps?" Include your proposed hour estimate for the deliverables above. Get creative — attachments and notes welcome. Note on AI-generated proposals: I use AI heavily myself — but if your proposal or screening answer is clearly AI-generated boilerplate, you will be automatically rejected without consideration. I'm hiring your judgment and experience, not your ability to paste a prompt. Short, direct, human answers. __________________________________________________________________________ A NOTE ON TECHNOLOGY BOUNDARIES: ***QUICK EXAMPLE*** FastAPI and Iceberg are what the platform uses today, not permanent decisions. As the product scales, we may want to run FastAPI alongside a second framework, replace it entirely, or eventually move away from Iceberg toward a custom storage format optimized for the GPU serve layer. Those should be engineering decisions made on merit, not decisions we're forced into because the current code makes swapping painful. What I need confirmed: is the codebase modular enough that a change like that stays contained? Core business logic (validate, merge, materialize, version) should never be tangled directly with infrastructure. API routes should be thin entry points that hand off to service logic, not where business logic lives. Iceberg writes should be isolated behind a single abstraction. If those boundaries are clean, replacing or extending a technology layer is a focused engineering effort. If they're not, it touches everything and becomes a mess under deadline pressure with a full team. Flag anywhere that boundary is broken. That's a priority finding. __________________________________________________________________________ FINAL REMARKS: NDA & IP protections This engagement requires signing an NDA and IP assignments agreement before work begins; standard protections given you'll have full repo access to a pre-launch product. Documents are provided on day one; nothing unusual in them. If that's a dealbreaker, please don't apply.

Less than 30 hrs/week
Hourly
3-6 months
Duration
Expert
Experience Level
$50.00
-
$75.00
Hourly
Remote Job
Ongoing project
Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more

Skills and Expertise

Mandatory skills

Python

ETL Pipeline

Apache Parquet

Activity on this job

Proposals:10 to 15
Last viewed by client:2 weeks ago
Interviewing:
3
Invites sent:
0
Unanswered invites:
0

About the client

Member since Oct 10, 2025

USA
Dallas5:51 AM
$368 total spent
4 hires, 1 active
7 hours

Explore similar jobs on Upwork

Database University AssignmentsHourly‐ Posted 9 months ago

SQL

Database

Microsoft Excel

Database Design

Database Management

SQL Server Integration Services

Excel Macros

Excel Formula

Microsoft Power BI

Microsoft Excel PowerPivot

Power Query

Data Entry

Data Cleaning

Data Analytics

Data Extraction

Data Engineer (Azure Fabric / Databricks)Hourly‐ Posted 4 weeks ago

Databricks Platform

Python

Fabric

Microsoft Azure

ETL Pipeline

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.