We are a fast-growing predictive analytics firm in New York City, looking for data mining / machine learning geniuses to help us with our day-to-day projects.
Our clients include small business lenders, hedge funds, economists, retailers and real estate agencies.
IMPORTANT: We use Weka extensively, so please don't apply if you're not an expert with the software.
To pick the best candidates for the position, we've created a small test to test your skill level.
To Apply, You Must:
1. Create a Weka Serialized Model file to solve the classification problem discussed below
2. Send us the serialized model file
3. We'll test your serialized model file against an out of sample set
4. If your classification model performs well, we'll hire you
We have built a system that automatically creates standardized machine learning algorithms, by using a wide variety of Weka classifiers. This system has been built to solve binary classification problems only.
The system automatically splits labeled instances into two sets: a training set (tr) and a testing set (te).
The system also creates two additional files with balanced class distributions, based on the training set (tr-x1) and testing set (te-x1).
All algorithms are built on the training set (tr), and each model's accuracy is evaluated on te, te-x1, tr and tr-x1 (by using Weka's experimenter)
Your task is to build a classifier (i.e. Weka serialized model file) to identify which algorithms are expected to do well in an out-sample test, based on how they perform during these in-sample experiments on tr, tr-x1, te and te-x1.
The training and testing sets listed below contain instances of experiment performance for a variety of algorithms, along with a label, depending on how well those algorithms performed on an out-sample test.
You'll be expected to send a serialized model file to us, that we'll test on an out sample set.
We'll hire the candidates who can produce the most accurate classification algorithms during the out sample test.
Training and Test Sets:
You can download the training set here:
If you'd like, you can test your models against this test set:
Each line represents a unique instance, i.e. experiment performance data on a unique algorithm.
Labels (classes) are found in the last column (column CS). In other words, each algorithm is already labeled as "Good" or "Bad", based on how they performed on an out sample test.
You'll notice that most of the attributes are derived from Weka's experiment outputs.
Your task: Use these training and testing sets to build a serialized model to classify new instances as "Good" or "Bad".
Good luck! We look forward to hearing from you.