PDF Price List Data Extraction Pipeline (AI/Python) — 30 Brands
Worldwide
im a part of a commercial kitchen equipment distribution business and manage price lists from 30+ international brands. Each brand provides an annual PDF price list (200–300 pages each) that contains a mix of product images, multilingual descriptions, technical drawings, and price tables. I need a professional Python developer with AI/LLM experience to build an automated extraction pipeline that pulls structured product and pricing data from these PDFs into a standardized Master Excel database — which will then feed into a custom quotation software I am building. THE PROBLEM These PDFs are NOT simple text tables. They are professionally designed catalogues (Adobe InDesign exported) with: Full-bleed marketing/image pages (must be skipped) Multilingual product description pages (4 languages: IT, EN, FR, DE) Technical drawing pages with dimensions Price table pages (the TARGET) containing: SKU code, model name, dimensions (mm), weight (kg), power specs (W/V/Hz), energy class, refrigerant gas, and list price (€) Layouts vary significantly between brands. A traditional PDF text parser will not work reliably. This requires an AI Vision approach. WHAT I NEED BUILT A repeatable Python pipeline that does the following: Page classification — Convert each PDF page to an image and use AI (GPT-4o Vision or Claude API) to classify each page as: intro, spec, drawing, or price_table. Only price_table pages proceed. Structured data extraction — Send each price_table page image to the AI Vision API with a structured prompt that returns clean JSON: SKU, model name, dimensions, weight, power, energy class, temperature range, list price. Data normalization — Python script cleans the output: standardizes units (mm, kg, W), handles multi-line model names, removes duplicate header rows, validates numeric price fields. Excel output — Exports to a Master Excel file with consistent columns across all brands. Update-ready — Pipeline must be reusable. When a new annual price list arrives, I re-run it on just that PDF and the master database updates. REQUIRED MASTER EXCEL COLUMNS Brand | SKU / Item Code | Model Name | Product Family / Series | Product Category | Width (mm) | Depth (mm) | Height (mm) | Net Weight (kg) | Gross Weight (kg) | Volume (L) | Power Supply (V/Hz) | Power Consumption (W) | Refrigerant Gas | Energy Class | Temperature Range (°C) | List Price (€) | Currency | Price List Version | Price List Date | Source Page | Notes DELIVERABLES Working Python script/pipeline (clean, commented code) Successfully extracted Excel output for 1 full brand PDF (POC first) Documentation on how to run the pipeline for each new brand Handover call to walk me through the process PROJECT PHASES Phase 1 (POC) — Start here: Process 1 sample brand PDF (~280 pages). Deliver clean Excel output. I review accuracy. If 90%+ accurate, we proceed. Phase 2 — Full rollout: Process remaining 29 brand PDFs. Refine extraction prompts per brand layout. Final master database delivered. SKILLS REQUIRED Python (strong) OpenAI GPT-4o Vision API or Anthropic Claude API PDF processing (PyMuPDF, pdfplumber, pdf2image) JSON parsing and data normalization Excel/openpyxl output Experience with document AI / OCR pipelines WHAT I WILL PROVIDE 2–3 sample brand PDFs to start The required Excel schema Clear feedback on extraction accuracy during QA API keys for OpenAI or Anthropic BUDGET Phase 1 (POC): Fixed price —WILL DISCUESS Phase 2 (Full 30 brands): Discuss after POC approval API usage costs will be covered by me separately.
$500.00
Fixed-price- ExpertExperience Level
- Remote Job
- Complex projectProject Type
Skills and Expertise
Activity on this job
- Proposals:50+
- Interviewing:0
- Invites sent:0
- Unanswered invites:0
About the client
- United Arab Emirates6:42 PM
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by