If you're knowledgeable in both Matlab and Python, then read on!
I have a Matlab script comprising various functions (running locally on my laptop) that I'd like to scale up, locate remotely on a server and automate.
The script communicates via API with a dropbox storing both zipped HTML files and data in JSON format, and via JDBC to a MySQL database instance (RDS) containing indexes, html & domain information and filename/location information for those JSONs.
My idea is to convert the Matlab code to Python, as I'll be doing further processes on the stored feature data using NumPy and SciPy over the next few months. Also it'd be easier for me to dip into and add features to the code down the line (I'm currently learning Python - and not very Java/C++ savvy).
What the Matlab script does
It basically extracts features from each HTML file from a chosen previously- crawled domain by parsing the HTML contents for DOM elements (it currently uses jSoup - Java library called by Matlab).
Example features (of about 113) are:
'Maximum length of h3',
'Average length of h3',
'Number of text characters in html',
'Title total number of rare words',
'Title total number of syllables'
The HTML files are extracted from zip files on the remote dropbox, 'scanned' by jSoup and the parsed feature values are then arranged into a Matlab cell (html 'Id' on one axis, features the other, and a grid full of integers inbetween) which is converted to JSON format and then stored back onto the same remote dropbox, with the MySQL index updated.
I'd like the converted program to run on an EC2 server, in an asynchronous / non-blocking / parallel fashion if possible and periodically check which jobs need to be run by looking at the MySQL database to see if new domains have been added or re-crawled recently- and then go ahead with those required.
I will supply the full Matlab code and provide guidance throughout. I prefer to work through github and on a task by task basis, with agreed estimated timing before each task, and a review time post task to fix bugs.
This may help: I've spotted Twisted - a multithreaded async python framework
Also - Beautifulsoup, a Python HTML parser.
It'd be a plus if you're familiar with either of these or similar.