I am looking for someone out there to help with a natural language processing / machine learning problem. I have a series of similar but inconsistent text files that have numbered hierarchical paragraphs of unstructured text with defined terms scattered throughout the blocks of unstructured text. The attached document has more information as well as a sample of the format. The exact structure therein cannot be counted on to appear exactly as shown, but there will be some sort of hierarchy following this format. There may or may not be indenting. You can’t do the parsing based on formatting because it will not be consistent across files. Somehow, however, you will have separate, hierarchical paragraphs with some sort of numbering and / or lettering scheme similar to the one shown below.
Ideally, I'd like to parse this document into an xml file. I want to preserve the hierarchy of the document in the xml file. Within the xml file, I’d like to achieve two things in particular. First, all defined terms should be tagged with the xml tags for <definition></definition>. Second, if there are internal references to other parts of the file, I’d like those references to be tagged and refer to the corresponding part of the xml document. For example, see some examples in 2(a)(j) of the attachment. As you can see, you might see a reference set off by Section #. You also might see a reference to another section that is a child of the same parent section – e.g. Subsection b of this subsection.
I’d prefer a solution in python or Java. You are free to use whatever open source software or libraries you choose.
A sample format with more information is attached as a word doc