Automated Ontology Generation (Part 3: Toolkits and SDKs)

June 6, 2022

4 min

Selecting a programming language

In the previous blogs, Automated Ontology Generation (Part 1) and Automated Ontology Generation (Part 2: Results),( I advocated using natural language processing (NLP) on Upwork’s user profiles and job posts.

Although not typically used for NLP implementations, we decided to implement the software in Java. For automated ontology generation, Java has some key benefits:

There is no requirement for complex matrix operations
Most software developers at Upwork can write Java code
Lends itself to multithreading and enabling parallel processing

Simplify the work with NLP toolkits

Upwork uses the open-source Java CoreNLP toolkit, https://stanfordnlp.github.io/CoreNLP, to simplify working with NLP. As great as CoreNLP is, there were some painful issues:

Lack of technical documentation
Lack of testing (especially when software scales)

Performing POS tagging

Part of speech (POS) tagging was accomplished using the MaxentTagger class as a baseline. This class is thread-safe, fast (60K documents every 10 minutes using multiple threads), and could provide a solid foundation for subsequent improvements.

Using Simple API

Scaling applications for parallel processing tasks require fine-tuning parser threads on server hardware. Parsing consumes a lot of CPU cycles. Initially, we set the number of parsing threads to 1, which matched the number of cores on the test hardware.

Dependency parsing was added using CoreNLP’s Simple API, https://stanfordnlp.github.io/CoreNLP/simple.html, and we witnessed some concerns:

Dependency parsing with Simple API is about three times slower (20K documents over 10 minutes) than just POS tagging with MaxentTagger.
Simple API slows down exponentially as it processes more documents. Simple API spent more than 10 seconds processing a single document by the end of the first hour.

Using the standard API

Next, we tried the standard API, https://stanfordnlp.github.io/CoreNLP/api.html, with these results:

The standard CoreNLP API performed initially better than Simple API, processing 30K documents in 10 minutes.
Over time, performance degraded somewhat to about 15K documents over 10 minutes.

Guidelines published in https://stanfordnlp.github.io/CoreNLP/memory-time.html were followed. We discovered that named entity recognition (NER) was not a viable option to meet Upwork’s scale. With NER configured as a parsing pipeline option, only 5K documents could be processed in 10 minutes.

Analyzing dependencies with SemanticGraph class

To discover compound terms, we experimented with the SemanticGraph class, https://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/semgraph/SemanticGraph.html. As Simple API dependency checking was easy and straightforward, SemanticGraph requires using the standard API.

There weren’t any examples explaining the return collections and links among class methods. For example, the method getLeafVertices() returns a set of IndexedWords with no explanations. With a lot of trial and error, the Java code snippet finds the parts of a compound term from our app:

The code example (above) is finding the governor, https://en.wikipedia.org/wiki/Government_(linguistics), of compound relations. A governor is a linguistic term that refers to the relationship between a word and its dependents. The code implements a “belt and suspenders” strategy by checking for incoming and outgoing edges. Without documentation, we weren’t sure if our code should check one edge group or if both groups needed to be checked.

Using a database

For intermediate and final storage of the results of NLP work, you’ll need a relational database. Since we wanted to embed the relational database with your app, we tried HyperSQL (HSQLDB), https://hsqldb.org. HSQLDB can switch from in-memory tables to on-disk tables with the setting of a single keyword. HSQLDB is fast, reliable, and well-documented.

To our pleasant surprise, the time the app spends writing to the database and performing database-related tasks is negligible compared to parsing documents. Specifically, database time was between 10-20 percent of parse time. Finally, we kept database code generic so that we could easily switch to another relational database in the future.

Language Detecting a language

All Upwork documents (like user profiles, job postings, and catalog projects) are expected to be complete and written in English. However, about 1 percent of documents are either written in other languages, too short, or not usable.

A Java-based language detector, https://github.com/optimaize/language-detector, can be used to filter outlier documents before any NLP work is performed. With over 70 build-in language profiles included, the language detector appears to work well.

What the application does

The NLP app has been designed as an organized set of “share nothing executions.” The partitioning of source data is performed based on each execution, retrieving a subset of source documents allocated for that execution. For example:

If each source document has a unique identifier in range R and ...
We want to execute ten instances simultaneously, then ...
Each instance is given documents in one of R/10 ranges, assuming an even distribution of the identifier across the ranges.

The result is as follows:

The first partition gets the range 0 to R/10
The second partition gets the range R/10 to 2*R/10
And so on ...

Each application has one thread reading the source documents, filtering out non-English language and junk documents, and putting documents into a limited capacity queue for threads to parse. A parsing thread stores the terms in the HSQLDB database as it increments running aggregates. The process is shown below in this high-level view of the app design:

High level view of application design

SQL was used to store the terms and perform all calculations to avoid deadlocks between threads and application instances.

Summary

We started this project to determine if we could grow our ontology automatically to do the following:

Keep up with the progress in human knowledge
Discover new specialties that freelancers have and clients need on Upwork
Produce ontology tags from text in such a way that tags keep up with market trends

The last item in the list deserves more explanation. If a new JavaScript framework appears on the job markets, Upwork needs to ensure that those with that new skill can advertise their skill without delay.

The evolving Upwork framework I have presented in this three-part Ontology series is a good start. Ontologists need to look at the results before updating the ontology in a shorter period of time and with significantly less effort. That’s where automation can make the difference.