Web Scraper for phpBB Forum

Posted 8 hours ago

Worldwide

Summary

Job Description: I am looking for an experienced web scraper to extract posts and download specific file attachments from an online forum. The goal is to save the forum's historical data into structured text files (.txt or .md) and archive the PDFs so I can use them as source documents in Google NotebookLM. Target Website: A small, niche automotive community forum running on standard phpBB software. I will share the exact URL with you via Upwork private messages so you can evaluate the site structure before accepting the contract. Project Scope & Requirements: Data Extraction: Scrape all threads and posts from the designated subforums. I will provide a standard user account so your script can access the required areas. PDF Downloads: Your script must maintain the authenticated session to download any .pdf files attached to the forum posts. (You can ignore image attachments like .jpg or .png, as I only need text-based documents). Data Formatting: The extracted data must be clean, with no HTML tags or website navigation junk. When a post contains a PDF attachment, you must save the PDF locally and insert a reference note in the text file. Each post needs to be formatted consistently like this: "Plaintext Forum Section: [Name of Subforum] Thread Title: [Title of the thread] Author: [Username] Date: [Date of post] Post: [The actual text of the post] Attachments: [Attachment downloaded: exact_filename.pdf]" --- File Splitting: Because Google NotebookLM has a 500,000-word limit per file, you must output the data as separate .txt or .md files divided by subforum (e.g., ForumSection_1.txt, ForumSection_2.txt). Polite Scraping: To prevent an IP ban, please implement a strict crawl delay (e.g., 2-3 seconds per request) and use a standard User-Agent. This is a personal, non-commercial project, and we do not want to strain the server. Deliverables: I do not need the scraping code itself. The final deliverable should be a ZIP file containing: The clean, separated .txt or .md files. A folder containing all downloaded PDF attachments, retaining their original file names so they match the references in the text files.

  • Less than 30 hrs/week
    Hourly
  • 1-3 months
    Duration
  • Intermediate
    Experience Level
  • $8.00

    -

    $25.00

    Hourly
  • Remote Job
  • Ongoing project
    Project Type
Skills and Expertise
Mandatory skills
Web Crawling
Data Scraping
Nice-to-have skills
Scrapy
Data Mining
Activity on this job
  • Proposals:20 to 50
  • Interviewing:
    22
  • Invites sent:
    30
  • Unanswered invites:
    6
About the client
Member since Mar 28, 2015
  • United Kingdom
    London2:06 PM
  • $27K total spent
    45 hires, 8 active
  • 1,464 hours
  • Manufacturing & Construction
    Mid-sized company (10-99 people)

Explore similar jobs on Upwork

Local Lead GenerationHourly‐ Posted 2 weeks ago
Web Scraping
Data Scraping
Data Extraction
Lead Generation
Data Entry
Data Mining
Data Collection
Data Processing
Web Scraping Framework
Web Crawler Framework
Web Scraping Software
Web Scraping Plugin
Web API
Search Tool
Search Engine
Microsoft Word
Data Entry
Administrative Support
Microsoft Excel

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo