I would like to scrape data from this USPTO website:
It has a CAPTCHA, but only seems to require it to be filled out once per session. (I haven't verified how many queries it takes to trigger it to reappear).
After the CAPTCHA you can look up a patent application by "Publication Number." I have a list of 3,996,534 publication numbers. An example of a publication number would be: 20120257316. For each publication number, I want to scrape the basic bibliographic data that shows up in the "Application Data" tab. I want the scraped data in tab separated values (tsv) format with UTF-8 character encoding.
On the USPTO website, when you look up a publication number, there is also usually a tab called "Image File Wrapper", which contains pdf files associated with the patent application. For each publication number that has the "Image File Wrapper" tab present, I want to download the pdf files with the document code "OATH" and "ADS." Note, even when the Image File Wrapper tab is present, both of these pdfs may not be present. In particular the ADS file is often missing.
Thus the project has two deliverables:
- A TSV file with one row for each publication number and one column for each piece of bibliographic data in the application data sheet.
- A folder with one directory for each publication number (the directories should be named by publication number) containing 0-2 pdf files corresponding to the OATH and ADS documents when present.