Python/PostgreSQL Engineer for Data Pipeline Productionization, QA, and Scale-Up

Posted 3 weeks ago

Worldwide

Summary

We have completed a proof-of-concept patent data pipeline for IP/patent analytics. The current system processes patent records from a PostgreSQL database and includes two core workflows: Assignee standardization: cleans and canonicalizes company/organization names using normalization, fuzzy matching, embeddings, and optional LLM review. Patent abstract translation: translates Japanese patent titles/abstracts into English using the DeepL API and stores translation status/metadata. Patent document ingestion: the next phase should add support for ingesting Japanese, Korean, and Chinese patent PDFs, extracting structured text/metadata, and routing extracted content into the existing translation and normalization workflows. We are now looking for an experienced Python data engineer to take the pilot into its next phase: harden the pipeline, add a robust PDF ingestion workflow for Japanese, Korean, and Chinese patent documents, improve reliability and evaluation, prepare it for larger-scale batch runs, and make the workflow easier to operate and audit. You will work from an existing Python codebase and technical documentation. The first job is not to rebuild the system from scratch. It is to review what exists, identify the highest-risk gaps, and productionize the current pilot into a cleaner, repeatable workflow. Primary responsibilities: - Review the existing repository, docs, Docker setup, PostgreSQL schema, and pilot scripts. - Improve the assignee standardization pipeline for larger batches and repeatable evaluation. - Improve the translation pipeline for resumable, logged, retry-safe DeepL batch translation. - Design and implement an ingestion workflow for Japanese, Korean, and Chinese patent PDFs. - Extract text, page structure, patent identifiers, titles, abstracts, claims, applicants/assignees, publication/application numbers, dates, and language metadata where available. - Handle both text-native PDFs and scanned PDFs using OCR where needed. - Route Japanese, Korean, and Chinese extracted text into translation workflows and preserve links back to source PDF/page locations. - Add or refine database migrations/schema setup for canonical assignees, translated abstracts, and status logs. - Add database tables or columns for source documents, extracted sections, OCR confidence, extraction status, and processing errors. - Add stronger QA/evaluation reporting for both assignee matching and translation outputs. - Make the pipeline safer to run repeatedly without accidentally dropping useful data. - Improve configuration handling, secrets handling, logging, and operational documentation. - Add focused automated tests around core matching, translation, database write, and failure-handling paths. - Produce a short handover document explaining how to run, monitor, and validate the workflows. Current Stack: - Python 3.11/3.12 - PostgreSQL - Docker / Docker Compose - Polars - asyncpg - httpx / asyncio - DeepL API - RapidFuzz / fuzzy matching - Sentence embeddings / semantic matching - Optional LLM-based canonicalization - PDF parsing/OCR tooling such as PyMuPDF, pdfplumber, Tesseract, PaddleOCR, EasyOCR, or cloud/on-prem OCR alternatives - Japanese/Korean/Chinese text handling and Unicode normalization - mdBook technical documentation Desired Deliverables 1. Codebase review and implementation plan - Short written assessment of current pipeline state. - Risks, quick wins, and proposed implementation sequence. 2. Production-ready pipeline improvements - Resumable assignee and translation runs. - Idempotent database writes where appropriate. - Safer table creation/migration behavior. - Better batch-level logging and error reporting. - Clear confidence/needs-review outputs. 3. Japanese/Korean/Chinese PDF ingestion workflow - Ingest JP/KR/CN patent PDFs from a local folder, object store, or configured source location. - Detect whether each PDF has embedded text or requires OCR. - Extract document-level metadata and section-level text. - Preserve source traceability: filename, document ID, page number, extraction method, OCR confidence, and processing status. - Store extracted content in PostgreSQL in a format suitable for downstream translation, assignee matching, and patent analytics. - Provide a retry-safe status log for failed or partial extractions. 4. Evaluation and QA - Assignee matching accuracy report against the gold list. - Translation run summary with success/failure counts and sample QA output. - PDF extraction QA report for Japanese, Korean, and Chinese samples, including extraction success rate, OCR confidence, section detection quality, and known failure modes. - Clear list of edge cases and known limitations. 5. Testing - Automated tests for core normalization, matching, translation status handling, and database write logic. - Tests or fixtures for text-native PDFs, scanned PDFs, malformed PDFs, and missing metadata cases. - Instructions for running tests locally and in Docker. 6. Documentation - Updated setup/runbook docs. - PDF ingestion setup and troubleshooting guide. - Handover notes for operating the pipeline on a secure cloud VM. You should have strong experience with: - Python data pipelines and batch processing. - PostgreSQL schema design and data write patterns. - Dockerized development environments. - Async Python, API integrations, retries, and rate limits. - Data quality evaluation and practical QA reporting. - Fuzzy matching, entity resolution, or record linkage. - PDF parsing, OCR, and text extraction workflows. - Japanese, Korean, and/or Chinese document processing. - Writing clean, maintainable code in an existing repo. Nice to have: - Patent data experience. - NLP/entity canonicalization experience. - DeepL or translation API experience. - Experience with Japanese Patent Office, Korean Intellectual Property Office, or China National Intellectual Property Administration patent documents. - Experience extracting structured sections from multilingual patent PDFs. - GDPR-conscious infrastructure or European/Swiss cloud deployment experience. - LLM-assisted data processing experience.

More than 30 hrs/week
Hourly
1-3 months
Duration
Intermediate
Experience Level
$10.00
-
$25.00
Hourly
Remote Job
Ongoing project
Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more

Skills and Expertise

Mandatory skills

Docker

Python

ETL Pipeline

Activity on this job

Proposals:20 to 50
Last viewed by client:3 weeks ago
Interviewing:
8
Invites sent:
10
Unanswered invites:
1

About the client

Member since May 22, 2025

IND
Hyderabad5:26 PM
$542 total spent
3 hires, 1 active
9 hours

Explore similar jobs on Upwork

Database University AssignmentsHourly‐ Posted 9 months ago

SQL

Database

Microsoft Excel

Database Design

Database Management

SQL Server Integration Services

Excel Macros

Excel Formula

Microsoft Power BI

Microsoft Excel PowerPivot

Power Query

Data Entry

Data Cleaning

Data Analytics

Data Extraction

Gcp data engineerHourly‐ Posted 1 month ago

Google Cloud Platform

BigQuery

ETL Pipeline

Google APIs

Data Warehousing & ETL Software

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.