I have an invoice that is delivered to me via PDF. I need to parse the information on the header and line items and put the information in a spreadsheet as a TAB or CSV. This is way easy for someone who knows Perl well, and trivial for someone who lives, eats and breathes Regexes.
Imagine that you'll be receiving the data via a text file, that you would read in via STDIN. You will write to a file with a predefined file name (set as a constant in the program) or via the pipe. An example call of the program might look like this:
perl myparser.pl <inputfile.txt >outputfile.txt
There are multiple line items on a page (from 1 to perhaps up to 10), and often multiple pages on an invoice. It's easy to identify the few items we need from the header. The line items are a little less clear. Here's an example of the line item format after copying the text from the PDF:
ItemNumber MetalType Description line one of multiple words aDigit
Description line 2 (optional)
Description line 3 (optional)
StoneType2 Weight2 (optional)
StoneType3 Weight3 (optional)
And here is what the data might look like.
XY00123 STL10 Large steel tub 1
with quartz base and handles
A few notes:
The description is sometimes one line, sometimes two, rarely three.
There is always one price line (with the price and extension)
There is always one metal weight line.
There can be one or more stone weight lines
The digit at the end of the description and at the very end of the line item are both meaningless.
I would want this data to be parsed (preferably using Regexes) into a TAB or CSV delimited file. Some of the fields will be interpreted/converted via some rules (like the metal) so it looks like the lines below. I'll explain in detail once you are hired for the gig. (The first two fields listed here are from the header. They are invoice number and invoice date). Here's the sample output.
989898,8/7/2013,XY00123,10 gauge Steel, Large metal tub with quartz base and handles,"1,604.00","481.20",7.6,Quartz,0.25 oz,Cubic Zirconia,0.94 oz
Note that there would be many of these lines on each invoice, spread out over several pages. It's likely we may need a couple more fields from the header, which means that the code should be well documented, easy to read, and easy enough for me to modify on my own if I want to add or remove a field.
This is a VERY easy program for a perl/regex expert. Apply for this gig and let me know if you think you can do it, for what price, and in what amount of time.