Expert Scrapper — Bulk Image Download from Database

Posted 2 weeks ago

Worldwide

Summary

Summary I need an experienced scraper to handle bulk retrieval of scanned document images, organize them into a structured directory, extract a small number of fields from each image, and produce a manifest linking every file to its source identifier and metadata. High-volume, long-running task requiring care, file integrity, and disciplined monitoring of a multi-week pipeline. Scope of work 1. Scope is limited to four jurisdictions — California, New York City, Ohio, and Michigan — within a single collection. 2. Download the FULL-RESOLUTION images, not thumbnails. Throughput is expected to be ~5–6 sec/image, so plan for a continuous multi-week run (~2–3 weeks). 3. Persistent task database with resume support: an interruption or block must not require re-downloading completed files. 4. Store images in a directory hierarchy mirroring the source collection structure, sharded to avoid filesystem performance issues at scale. 5. For each image, record in a manifest (CSV or parquet) with requested variables 6. Verify file integrity (non-zero size, valid format) and re-download failures. 7. Deliver images to a researcher-provided private AWS S3 bucket. Provide weekly progress reports: images downloaded, error rate, estimated completion date, and any issues encountered. Deliverables (two milestones, 50% each, subject to review) - Milestone 1 — New York City + Michigan: manifest + full-res images to S3. - Milestone 2 — Ohio + California: manifest + full-res images to S3. Required skills 1. Strong Python, including authenticated session handling (requests / Playwright or equivalent). 2. Recoverable / resumable rate-limited bulk-download pipelines (handles network interruptions, server errors, and auth refresh without losing progress). 3. File-system organization at scale (millions of files; directory sharding). 4. In-image text / OCR extraction. 5. Logging and progress-monitoring discipline. 6. Familiarity with manifest formats (CSV/parquet) and metadata management.

  • $1,000.00

    Fixed-price
  • Expert
    Experience Level
  • Remote Job
  • One-time project
    Project Type
Skills and Expertise
Mandatory skills
Data Scraping
Python
Data Mining
Activity on this job
  • Proposals:20 to 50
  • Last viewed by client:2 weeks ago
  • Hires:
    1
  • Interviewing:
    0
  • Invites sent:
    1
  • Unanswered invites:
    0
About the client
Member since Aug 24, 2023
  • United States
    Berkeley6:14 AM
  • $9.8K total spent
    9 hires, 1 active

Explore similar jobs on Upwork

Local Lead GenerationHourly‐ Posted 2 weeks ago
Web Scraping
Data Scraping
Data Extraction
Lead Generation
Data Entry
Data Mining
Data Collection
Data Processing
Web Scraping Framework
Web Crawler Framework
Web Scraping Software
Web Scraping Plugin
Web API
Search Tool
Search Engine
Microsoft Word
Data Entry
Administrative Support
Microsoft Excel

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo