
In the previous blogs, Automated Ontology Generation (Part 1) and Automated Ontology Generation (Part 2: Results),( I advocated using natural language processing (NLP) on Upwork’s user profiles and job posts.
Although not typically used for NLP implementations, we decided to implement the software in Java. For automated ontology generation, Java has some key benefits:
Upwork uses the open-source Java CoreNLP toolkit, https://stanfordnlp.github.io/CoreNLP, to simplify working with NLP. As great as CoreNLP is, there were some painful issues:
Part of speech (POS) tagging was accomplished using the MaxentTagger class as a baseline. This class is thread-safe, fast (60K documents every 10 minutes using multiple threads), and could provide a solid foundation for subsequent improvements.
Scaling applications for parallel processing tasks require fine-tuning parser threads on server hardware. Parsing consumes a lot of CPU cycles. Initially, we set the number of parsing threads to 1, which matched the number of cores on the test hardware.
Dependency parsing was added using CoreNLP’s Simple API, https://stanfordnlp.github.io/CoreNLP/simple.html, and we witnessed some concerns:
Next, we tried the standard API, https://stanfordnlp.github.io/CoreNLP/api.html, with these results:
Guidelines published in https://stanfordnlp.github.io/CoreNLP/memory-time.html were followed. We discovered that named entity recognition (NER) was not a viable option to meet Upwork’s scale. With NER configured as a parsing pipeline option, only 5K documents could be processed in 10 minutes.
To discover compound terms, we experimented with the SemanticGraph class, https://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/semgraph/SemanticGraph.html. As Simple API dependency checking was easy and straightforward, SemanticGraph requires using the standard API.
There weren’t any examples explaining the return collections and links among class methods. For example, the method getLeafVertices() returns a set of IndexedWords with no explanations. With a lot of trial and error, the Java code snippet finds the parts of a compound term from our app:

The code example (above) is finding the governor, https://en.wikipedia.org/wiki/Government_(linguistics), of compound relations. A governor is a linguistic term that refers to the relationship between a word and its dependents. The code implements a “belt and suspenders” strategy by checking for incoming and outgoing edges. Without documentation, we weren’t sure if our code should check one edge group or if both groups needed to be checked.
For intermediate and final storage of the results of NLP work, you’ll need a relational database. Since we wanted to embed the relational database with your app, we tried HyperSQL (HSQLDB), https://hsqldb.org. HSQLDB can switch from in-memory tables to on-disk tables with the setting of a single keyword. HSQLDB is fast, reliable, and well-documented.
To our pleasant surprise, the time the app spends writing to the database and performing database-related tasks is negligible compared to parsing documents. Specifically, database time was between 10-20 percent of parse time. Finally, we kept database code generic so that we could easily switch to another relational database in the future.
All Upwork documents (like user profiles, job postings, and catalog projects) are expected to be complete and written in English. However, about 1 percent of documents are either written in other languages, too short, or not usable.
A Java-based language detector, https://github.com/optimaize/language-detector, can be used to filter outlier documents before any NLP work is performed. With over 70 build-in language profiles included, the language detector appears to work well.
The NLP app has been designed as an organized set of “share nothing executions.” The partitioning of source data is performed based on each execution, retrieving a subset of source documents allocated for that execution. For example:
The result is as follows:
Each application has one thread reading the source documents, filtering out non-English language and junk documents, and putting documents into a limited capacity queue for threads to parse. A parsing thread stores the terms in the HSQLDB database as it increments running aggregates. The process is shown below in this high-level view of the app design:

High level view of application design
SQL was used to store the terms and perform all calculations to avoid deadlocks between threads and application instances.
We started this project to determine if we could grow our ontology automatically to do the following:
The last item in the list deserves more explanation. If a new JavaScript framework appears on the job markets, Upwork needs to ensure that those with that new skill can advertise their skill without delay.
The evolving Upwork framework I have presented in this three-part Ontology series is a good start. Ontologists need to look at the results before updating the ontology in a shorter period of time and with significantly less effort. That’s where automation can make the difference.