Need to scrap an office building directory site which claims to have 50k records. Data are tidily shown up in templated pages. All pages are linked from a rigid 2-tier category structure with pagination. The top tier has only 3 categories.
You will need to write a php scraper to crawl all pages and write the data onto a utf-8 tab-delimited text file. For 50k records your may set the bot to start crawling from each given top-tier category. So the 50k records will be split into 3 txt files.
Sample page: primeoffice dot com dot hk slash hong_kong_office slash building_index slash Building_Profile.asp?B=10422
PS: First image URL of each building is also needed.