Hi, I’m looking for a database + python expert consultant.
I have a database of 5M rows * 10 columns text table. (Please look at the sample data file. It's just 1 row but I have 5 million rows of this.) The text isn't too long- it has about maximum 300 CHARACTERs in each text field. 2 columns contain numpy array of 50 floats.
What I need to do is to search a specific substring (or multiple substrings using OR) from specific column, and return that result.
For example, in sample data, I'd like to to return the row if the 'DO' column string contains 'revenue' substring.
Your job is:
1. Pick the best (fast) database to do this type of job. (if it's not mysql)
2. Write a code: how to install database on the ec2
3. Write(wrap) db search queries in PYTHON (for example if it’s sql:
db = MySQLdb.connect()
cursor = db.cursor()
sql = "SELECT tvct_raw.estimated_users from tvct_raw WHERE id = '%s' " % the_id
results = cursor.fetchall()
for row in results:
4. The database is currently in python pandas Dataframe format stored in hundreds of pickle files (<100mb).
Write query to migrate(store for the first time) them in the database.
5. Our server(ec2) is only 4G RAM. Considering the size of the data, recommend the ec2 upgrade option (if necessary.)
I was pretty satisfied with Pandas dataframe until now to use df[df['A'].str.contains("hello|jello")] method to match substring because it's pretty fast. But the reason why I need to go DB is after loading 10 Million rows in Pandas DF, my macbook says "out of space".
I'm looking for someone who can nicely comment on the code in each step.
If you are too cool for commenting on the code, please don't apply for this job.