Python/PostgreSQL Engineer for China Patent Data Pipeline Productionization, QA, and Scale-Up

Posted 6 days ago

Worldwide

Summary

We need a Python/PostgreSQL data engineer to implement the China-only patent data integration for an existing patent analytics platform. This is a fixed-price China milestone for IPPH patent data. NOTE : Japan and Korea loads are excluded based on conversations with Krish and may be handled later as separate follow-on work. The existing platform already has Python ingestion patterns, PostgreSQL, bronze/silver processing, MinIO/file-ingestion infrastructure, translation infrastructure, assignee standardization, and a dashboard. The goal is to extend the existing system, not rebuild it from scratch. Total Budget: $800 fixed price Milestone 1: IPPH File Ingestion, XML Parsing, and Initial Database Load Budget: $300 Scope: - Inspect the IPPH sample / initial delivery package structure. - Use the existing MinIO/file-ingestion pattern. - Handle package manifests and nested ZIP/XML packages. - Parse key China patent fields where available: - publication identifiers - application identifiers - claims - claim numbers - independent/dependent claim indicators - claim counts - description sections - bibliographic metadata - legal status metadata - current owner / assignee metadata - applicant/inventor metadata - drawings metadata - rich citation fields - Load raw and parsed data into PostgreSQL following the existing bronze/silver architecture. - Preserve source traceability: source file, package date, document path, document ID, load timestamp, and processing status. Acceptance Criteria: - Provided IPPH sample files can be processed end to end. - Parsed records are loaded into PostgreSQL or clearly structured for PostgreSQL loading. - Key source fields are mapped and documented. - Failed/partial records are logged with useful error messages. Milestone 2: Delta Handling, Translation, and Assignee Standardization (Full Load) Budget: $300 Scope: - Implement CREATE/UPDATE/DELETE handling for the confirmed IPPH package format. - Track processed packages/documents to avoid duplicate loads on rerun. - Add retry-safe/idempotent behavior where practical. - Integrate Chinese-to-English translation using the existing approved model endpoint / infrastructure. - Store original Chinese text, English translation, translation status, model/prompt/version metadata, and errors. - Integrate Chinese applicant/current-owner/assignee names into the existing assignee standardization pipeline. - Preserve raw Chinese names and translated/normalized names. - Add confidence/status fields or review flags where useful. Acceptance Criteria: - Rerunning the job does not duplicate already processed records. - CREATE/UPDATE/DELETE records are handled according to the confirmed IPPH package semantics. - Chinese text is routed through the agreed translation endpoint and stored with status metadata. - Chinese assignee data flows through the existing standardization process. Milestone 3: Dashboard Integration, QA, Tests, and Handover Budget: $200 Scope: - Make China data visible in the existing dashboard. - Reuse existing dashboard patterns; no dashboard rebuild. - Ensure China records can be filtered/viewed in relevant existing views. - Surface key parsed fields and standardized assignee information where supported by the current dashboard. - Add focused tests using sample files. - Provide validation counts: - files processed - documents parsed - records loaded - translations attempted/succeeded/failed - assignee records processed - Provide runnable setup instructions and short handover documentation. Acceptance Criteria: - All ingested China records after assignee standardization and translation are visible in the existing dashboard. - Basic tests pass against sample data. - A run summary/log is available for validation. - Documentation is sufficient for another developer to run, monitor, and validate the pipeline.

$800.00
Fixed-price
Intermediate
Experience Level
Remote Job
Ongoing project
Project Type

Skills and Expertise

Mandatory skills

API Integration

Python

Activity on this job

Proposals:10 to 15
Last viewed by client:6 days ago
Hires:
1
Interviewing:
0
Invites sent:
1
Unanswered invites:
0

About the client

Member since May 22, 2025

IND
Hyderabad10:05 PM
$542 total spent
4 hires, 2 active
9 hours

Explore similar jobs on Upwork

Local Lead GenerationHourly‐ Posted 2 weeks ago

Web Scraping

Data Scraping

Data Extraction

Lead Generation

Data Entry

Data Mining

Data Collection

Data Processing

Web Scraping Framework

Web Crawler Framework

Web Scraping Software

Web Scraping Plugin

Web API

Search Tool

Search Engine

Court Document Download and Dataframe CreationHourly‐ Posted 2 weeks ago

Microsoft Word

Data Entry

Administrative Support

Microsoft Excel

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.