Web Scraper for phpBB Forum
Worldwide
Job Description: I am looking for an experienced web scraper to extract posts and download specific file attachments from an online forum. The goal is to save the forum's historical data into structured text files (.txt or .md) and archive the PDFs so I can use them as source documents in Google NotebookLM. Target Website: A small, niche automotive community forum running on standard phpBB software. I will share the exact URL with you via Upwork private messages so you can evaluate the site structure before accepting the contract. Project Scope & Requirements: Data Extraction: Scrape all threads and posts from the designated subforums. I will provide a standard user account so your script can access the required areas. PDF Downloads: Your script must maintain the authenticated session to download any .pdf files attached to the forum posts. (You can ignore image attachments like .jpg or .png, as I only need text-based documents). Data Formatting: The extracted data must be clean, with no HTML tags or website navigation junk. When a post contains a PDF attachment, you must save the PDF locally and insert a reference note in the text file. Each post needs to be formatted consistently like this: "Plaintext Forum Section: [Name of Subforum] Thread Title: [Title of the thread] Author: [Username] Date: [Date of post] Post: [The actual text of the post] Attachments: [Attachment downloaded: exact_filename.pdf]" --- File Splitting: Because Google NotebookLM has a 500,000-word limit per file, you must output the data as separate .txt or .md files divided by subforum (e.g., ForumSection_1.txt, ForumSection_2.txt). Polite Scraping: To prevent an IP ban, please implement a strict crawl delay (e.g., 2-3 seconds per request) and use a standard User-Agent. This is a personal, non-commercial project, and we do not want to strain the server. Deliverables: I do not need the scraping code itself. The final deliverable should be a ZIP file containing: The clean, separated .txt or .md files. A folder containing all downloaded PDF attachments, retaining their original file names so they match the references in the text files.
- Less than 30 hrs/weekHourly
- 1-3 monthsDuration
- IntermediateExperience Level
$8.00
-
$25.00
Hourly- Remote Job
- Ongoing projectProject Type
Skills and Expertise
Activity on this job
- Proposals:20 to 50
- Interviewing:25
- Invites sent:30
- Unanswered invites:2
About the client
- United KingdomLondon4:54 PM
- $27K total spent45 hires, 8 active
- 1,464 hours
- Manufacturing & ConstructionMid-sized company (10-99 people)
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by