Parse PDFs into Structured JSON

Posted last week

Worldwide

Summary

We are looking for an experienced Python developer with strong skills in PDF parsing and data extraction to help us process a large batch of educational PDF files (exam papers) into a structured JSON format. The exams contain a mix of text, multiple-choice questions, math formulas, reading comprehension texts, and graphical elements (diagrams, tables, and images). Your Responsibilities: Data Extraction: Extract text, multiple-choice options, and correct answers from the PDF files. JSON Structuring: Map the extracted data into a predefined, highly structured JSON schema. Image Cropping/Extraction: Programmatically identify, crop, and save relevant images, diagrams, and graphs associated with specific questions. Edge Case Handling: Handle complex layouts, including multi-column text, rotated pages, and questions that span across multiple pages. Required Skills & Experience: Proven experience working with PDF extraction libraries in Python (e.g., PyMuPDF / fitz, pdfplumber, or similar). Experience with OCR tools or Vision-Language Models (e.g., OpenAI GPT-4o, Claude 3.5 Sonnet) for parsing complex graphical layouts is a huge plus. Strong understanding of JSON and data structuring. Attention to detail – the output JSON must be 100% accurate and ready for production use. Project Scope: We will provide a set of test PDFs and the desired JSON schema. You will develop a scalable script/pipeline to process these files. Once the pipeline is validated, it will be run across our entire library of PDFs.

  • Less than 30 hrs/week
    Hourly
  • < 1 month
    Duration
  • Intermediate
    Experience Level
  • $8.00

    -

    $25.00

    Hourly
  • Remote Job
  • One-time project
    Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more
Skills and Expertise
Mandatory skills
Python
JSON
JavaScript
Data Extraction
Activity on this job
  • Proposals:50+
  • Last viewed by client:6 days ago
  • Hires:
    1
  • Interviewing:
    1
  • Invites sent:
    1
  • Unanswered invites:
    0
About the client
Member since Aug 29, 2013
  • Sweden
    Lulea7:47 PM
  • $1.1M total spent
    183 hires, 61 active
  • 18,314 hours

Explore similar jobs on Upwork

Local Lead GenerationHourly‐ Posted 2 weeks ago
Web Scraping
Data Scraping
Data Extraction
Lead Generation
Data Entry
Data Mining
Data Collection
Data Processing
Web Scraping Framework
Web Crawler Framework
Web Scraping Software
Web Scraping Plugin
Web API
Search Tool
Search Engine
Microsoft Word
Data Entry
Administrative Support
Microsoft Excel

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo