Data Processing and Formatting with Python

Closed - This job posting has been filled and work has been completed.
Web, Mobile & Software Dev Scripts & Utilities Posted 1 year ago


Hours to be determined
Less than 1 week

Intermediate Level

Start Date

August 1, 2014


I have a data file that looks like this:

[S and/CC]
[VP will/FP collapse/VBP_MS3]
[NP #/PUNC brotherhood/NNP]
[PP with/IN others/NN]
[NP getting/NN_PRP_MS3]
[PP in/IN]
[WHNP that/WP]
[VP writes/VBD_MS3]
[NP unknown/NNS_MP]
[NP not/NN feel/NNP that/NNP]
[VP be/FP done/VBP_MS3]
[NP #/PUNC all/NNP]

And I want to convert it into this

and S-B CC
will VP-B FB
collapse VB-I VBP_MS3
brotherhood NP-I NNP
with PP-B IN
others PP-I NN
getting NP-B NN_PRP_MS3
in PP-B IN
that WHNP-B WP
writes VP-B VBD_MS3]
unknown NP-B NNS_MP
and O CC
be VP-B FP
done VP-I VBP_MS3
all NP-I NNP


Every sequence of words within square brackets are a single unit and the tag for that unit is NP, VP, WHNP ... etc.

Every word in the sequence has one of two positions either B (beginning) or I inside. So for example, NP-B means that this is the first word in the NP sequence, and NP-I means it is the 2nd, 3rd, 4th....nth word in the sequence.

Every word in the sequence has its one tag which starts with / . So we need to keep that information too. So the output has three columns:

the word, its position in the sequence (either B or I), its tag

Some words are not part of any sequence (i.e. they don't have any square brackets. These words will have one tag for the second columns which is O and then its other tag that starts with / as in the and/CC example.

About the Client

(4.65) 31 reviews

United States
Champaign 12:06 AM

70 Jobs Posted
60% Hire Rate, 1 Open Job

$2,231 Total Spent
45 Hires, 1 Active

$11.54/hr Avg Hourly Rate Paid
159 Hours

Member Since Mar 31, 2013