PDF Price List Data Extraction Pipeline (AI/Python) — 30 Brands

Posted last week

Worldwide

Summary

im a part of a commercial kitchen equipment distribution business and manage price lists from 30+ international brands. Each brand provides an annual PDF price list (200–300 pages each) that contains a mix of product images, multilingual descriptions, technical drawings, and price tables. I need a professional Python developer with AI/LLM experience to build an automated extraction pipeline that pulls structured product and pricing data from these PDFs into a standardized Master Excel database — which will then feed into a custom quotation software I am building. THE PROBLEM These PDFs are NOT simple text tables. They are professionally designed catalogues (Adobe InDesign exported) with: Full-bleed marketing/image pages (must be skipped) Multilingual product description pages (4 languages: IT, EN, FR, DE) Technical drawing pages with dimensions Price table pages (the TARGET) containing: SKU code, model name, dimensions (mm), weight (kg), power specs (W/V/Hz), energy class, refrigerant gas, and list price (€) Layouts vary significantly between brands. A traditional PDF text parser will not work reliably. This requires an AI Vision approach. WHAT I NEED BUILT A repeatable Python pipeline that does the following: Page classification — Convert each PDF page to an image and use AI (GPT-4o Vision or Claude API) to classify each page as: intro, spec, drawing, or price_table. Only price_table pages proceed. Structured data extraction — Send each price_table page image to the AI Vision API with a structured prompt that returns clean JSON: SKU, model name, dimensions, weight, power, energy class, temperature range, list price. Data normalization — Python script cleans the output: standardizes units (mm, kg, W), handles multi-line model names, removes duplicate header rows, validates numeric price fields. Excel output — Exports to a Master Excel file with consistent columns across all brands. Update-ready — Pipeline must be reusable. When a new annual price list arrives, I re-run it on just that PDF and the master database updates. REQUIRED MASTER EXCEL COLUMNS Brand | SKU / Item Code | Model Name | Product Family / Series | Product Category | Width (mm) | Depth (mm) | Height (mm) | Net Weight (kg) | Gross Weight (kg) | Volume (L) | Power Supply (V/Hz) | Power Consumption (W) | Refrigerant Gas | Energy Class | Temperature Range (°C) | List Price (€) | Currency | Price List Version | Price List Date | Source Page | Notes DELIVERABLES Working Python script/pipeline (clean, commented code) Successfully extracted Excel output for 1 full brand PDF (POC first) Documentation on how to run the pipeline for each new brand Handover call to walk me through the process PROJECT PHASES Phase 1 (POC) — Start here: Process 1 sample brand PDF (~280 pages). Deliver clean Excel output. I review accuracy. If 90%+ accurate, we proceed. Phase 2 — Full rollout: Process remaining 29 brand PDFs. Refine extraction prompts per brand layout. Final master database delivered. SKILLS REQUIRED Python (strong) OpenAI GPT-4o Vision API or Anthropic Claude API PDF processing (PyMuPDF, pdfplumber, pdf2image) JSON parsing and data normalization Excel/openpyxl output Experience with document AI / OCR pipelines WHAT I WILL PROVIDE 2–3 sample brand PDFs to start The required Excel schema Clear feedback on extraction accuracy during QA API keys for OpenAI or Anthropic BUDGET Phase 1 (POC): Fixed price —WILL DISCUESS Phase 2 (Full 30 brands): Discuss after POC approval API usage costs will be covered by me separately.

  • $500.00

    Fixed-price
  • Expert
    Experience Level
  • Remote Job
  • Complex project
    Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more
Skills and Expertise
Mandatory skills
Python
Data Extraction
Activity on this job
  • Proposals:50+
  • Interviewing:
    0
  • Invites sent:
    0
  • Unanswered invites:
    0
About the client
Member since Jun 22, 2026
  • United Arab Emirates
    6:42 PM

Explore similar jobs on Upwork

Local Lead GenerationHourly‐ Posted 2 weeks ago
Web Scraping
Data Scraping
Data Extraction
Lead Generation
Data Entry
Data Mining
Data Collection
Data Processing
Web Scraping Framework
Web Crawler Framework
Web Scraping Software
Web Scraping Plugin
Web API
Search Tool
Search Engine
Microsoft Word
Data Entry
Administrative Support
Microsoft Excel

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo