I've attached the project overview in Word format (link also here: https://drive.google.com/file/d/0B8ZJVPz-oSKjZ0pNS1paR2Nrc1E/view?usp=sharing), and also included additional SEC FTP resources in it which might be useful.
Download 10k of every publicly traded US company dating back to 1994 in PDF format.
The SEC makes annual reports (i.e., 10k) available via their website (Microsoft example: www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000789019&type=10-k&dateb=&owner=exclude&count=40), and also via FTP for free. When an annual report is posted each year, it is now done so as an HTML file, with a few different “exhibits” (i.e., other documents of relevance). I am mostly concerned with the 10k file itself, but would prefer having both the 10k and exhibits for each year. For example, this is Microsoft’s 2014 annual report from the SEC’s website: i.imgur.com/aBeGa6F.png
I would then like each annual report to be in a single PDF file, going back to 1994, in the following format: Symbol Year Company Name.pdf. For example, Microsoft would look like:
MSFT 2015 Microsoft.pdf
MSFT 2014 Microsoft.pdf
MSFT 2013 Microsoft.pdf
The links in each PDF file will have to behave as they do on the SEC’s website, as found in this example PDF: https://drive.google.com/file/d/0B8ZJVPz-oSKjaS1TbmNOZ0dlWlU/view?usp=sharing
The script will then check for new annual reports (and companies), and if it finds one, download it, convert it to PDF, and name it per the naming convention.