This project is for data entry/scraping and some manipulation of that data. I will provide you with a file containing ~367 PDF documents that contain data tables for horse races. Each document covers races run at a particular track on a particular day. Usually 9-12 races per document. The first part of this project is to transfer the data in those documents into an excel spreadsheet, with the exception of maiden races (those races marked Md), which can be left out.
Then, additional data needs to be added to the spreadsheet from the internet. All the additional data can be had at equibase.com. You will need to look up statistics for each horse, and also pull up the results of the races in question.
Additionally, you will need to perform some basic manipulations on the data. When you look up the results of the race, you will need to delete entries from the excel file that correspond to horses that “scratched” from the race--i.e. did not run. You will also need to normalize the numerical scores within each individual race. So, for example, if the speed scores in a six horse race are: 75, 54,65,32,88, those would need to be normalized in the range of 0 to 1. Then the class score for that race would likewise need to be normalized. And so on for each race. It is important that the normalization function be applied separately to each set of values within each specific race, and not to the values across all races. This is because the normalized number will say something about the horse’s characteristic relative to the other horses he is racing. This is extremely important.
Since I have no experience with data scraping, I do not know what parts of this will be accomplishable by means of scraping and what will have to be entered by hand. That is up to you to judge. I am providing a sample file and clear requirements so that you can judge what you will need to do to complete this job.
Here are the fields that I will need to have in the file. The fields marked with an asterix are those which need to be normalized with the other corresponding values within each race:
Odds (expressed as a percentage)
Winning % 2012 with this race subtracted
Winning % whole career with this race subtracted
Place % 2012 with this race subtracted
Place % whole career with this race subtracted
Show % 2012 with this race subtracted
Show % whole career with this race subtracted
% Finish in the money this year with this race subtracted
% Finish in the money whole career with this race subtracted
Earnings per start 2012
Earnings per start whole career
Best equibase speed 2012*
Best equibase speed whole career*
Win (Yes/No expressed as 0=no 1=yes)
Place (Yes/No expressed as 0=no 1=yes)
Show (Yes/No expressed as 0=no 1=yes)
Win money = 0 if horse did not win
Place money = 0 if horse did not win or place
Show money= 0 if horse did not win, place or show
Please let me know if you have any questions or if in any way this is not clear to you. I am happy to explain further in order to make sure that the job gets done right.