Python/PostgreSQL Engineer for Data Pipeline Productionization, QA, and Scale-Up
Worldwide
We have completed a proof-of-concept patent data pipeline for IP/patent analytics. The current system processes patent records from a PostgreSQL database and includes two core workflows: Assignee standardization: cleans and canonicalizes company/organization names using normalization, fuzzy matching, embeddings, and optional LLM review. Patent abstract translation: translates Japanese patent titles/abstracts into English using the DeepL API and stores translation status/metadata. Patent document ingestion: the next phase should add support for ingesting Japanese, Korean, and Chinese patent PDFs, extracting structured text/metadata, and routing extracted content into the existing translation and normalization workflows. We are now looking for an experienced Python data engineer to take the pilot into its next phase: harden the pipeline, add a robust PDF ingestion workflow for Japanese, Korean, and Chinese patent documents, improve reliability and evaluation, prepare it for larger-scale batch runs, and make the workflow easier to operate and audit. You will work from an existing Python codebase and technical documentation. The first job is not to rebuild the system from scratch. It is to review what exists, identify the highest-risk gaps, and productionize the current pilot into a cleaner, repeatable workflow. Primary responsibilities: - Review the existing repository, docs, Docker setup, PostgreSQL schema, and pilot scripts. - Improve the assignee standardization pipeline for larger batches and repeatable evaluation. - Improve the translation pipeline for resumable, logged, retry-safe DeepL batch translation. - Design and implement an ingestion workflow for Japanese, Korean, and Chinese patent PDFs. - Extract text, page structure, patent identifiers, titles, abstracts, claims, applicants/assignees, publication/application numbers, dates, and language metadata where available. - Handle both text-native PDFs and scanned PDFs using OCR where needed. - Route Japanese, Korean, and Chinese extracted text into translation workflows and preserve links back to source PDF/page locations. - Add or refine database migrations/schema setup for canonical assignees, translated abstracts, and status logs. - Add database tables or columns for source documents, extracted sections, OCR confidence, extraction status, and processing errors. - Add stronger QA/evaluation reporting for both assignee matching and translation outputs. - Make the pipeline safer to run repeatedly without accidentally dropping useful data. - Improve configuration handling, secrets handling, logging, and operational documentation. - Add focused automated tests around core matching, translation, database write, and failure-handling paths. - Produce a short handover document explaining how to run, monitor, and validate the workflows. Current Stack: - Python 3.11/3.12 - PostgreSQL - Docker / Docker Compose - Polars - asyncpg - httpx / asyncio - DeepL API - RapidFuzz / fuzzy matching - Sentence embeddings / semantic matching - Optional LLM-based canonicalization - PDF parsing/OCR tooling such as PyMuPDF, pdfplumber, Tesseract, PaddleOCR, EasyOCR, or cloud/on-prem OCR alternatives - Japanese/Korean/Chinese text handling and Unicode normalization - mdBook technical documentation Desired Deliverables 1. Codebase review and implementation plan - Short written assessment of current pipeline state. - Risks, quick wins, and proposed implementation sequence. 2. Production-ready pipeline improvements - Resumable assignee and translation runs. - Idempotent database writes where appropriate. - Safer table creation/migration behavior. - Better batch-level logging and error reporting. - Clear confidence/needs-review outputs. 3. Japanese/Korean/Chinese PDF ingestion workflow - Ingest JP/KR/CN patent PDFs from a local folder, object store, or configured source location. - Detect whether each PDF has embedded text or requires OCR. - Extract document-level metadata and section-level text. - Preserve source traceability: filename, document ID, page number, extraction method, OCR confidence, and processing status. - Store extracted content in PostgreSQL in a format suitable for downstream translation, assignee matching, and patent analytics. - Provide a retry-safe status log for failed or partial extractions. 4. Evaluation and QA - Assignee matching accuracy report against the gold list. - Translation run summary with success/failure counts and sample QA output. - PDF extraction QA report for Japanese, Korean, and Chinese samples, including extraction success rate, OCR confidence, section detection quality, and known failure modes. - Clear list of edge cases and known limitations. 5. Testing - Automated tests for core normalization, matching, translation status handling, and database write logic. - Tests or fixtures for text-native PDFs, scanned PDFs, malformed PDFs, and missing metadata cases. - Instructions for running tests locally and in Docker. 6. Documentation - Updated setup/runbook docs. - PDF ingestion setup and troubleshooting guide. - Handover notes for operating the pipeline on a secure cloud VM. You should have strong experience with: - Python data pipelines and batch processing. - PostgreSQL schema design and data write patterns. - Dockerized development environments. - Async Python, API integrations, retries, and rate limits. - Data quality evaluation and practical QA reporting. - Fuzzy matching, entity resolution, or record linkage. - PDF parsing, OCR, and text extraction workflows. - Japanese, Korean, and/or Chinese document processing. - Writing clean, maintainable code in an existing repo. Nice to have: - Patent data experience. - NLP/entity canonicalization experience. - DeepL or translation API experience. - Experience with Japanese Patent Office, Korean Intellectual Property Office, or China National Intellectual Property Administration patent documents. - Experience extracting structured sections from multilingual patent PDFs. - GDPR-conscious infrastructure or European/Swiss cloud deployment experience. - LLM-assisted data processing experience.
- More than 30 hrs/weekHourly
- 1-3 monthsDuration
- IntermediateExperience Level
$10.00
-
$25.00
Hourly- Remote Job
- Ongoing projectProject Type
Skills and Expertise
Activity on this job
- Proposals:20 to 50
- Last viewed by client:3 weeks ago
- Interviewing:8
- Invites sent:10
- Unanswered invites:1
About the client
- INDHyderabad5:26 PM
- $542 total spent3 hires, 1 active
- 9 hours
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by