Automatic Ontology Generation (Part 2: Results)

May 11, 2022

3 min

In the previous post, Automatic Ontology Generation (Part 1: Basics), I mentioned reporting the results of how it applies to freelancer profiles at Upwork. This is the focus of this second blog.

Generating results

Since deriving linguistic annotations for text can be difficult to understand and control, we rely on automatic ontology generation software like CoreNLP (a Java text analysis software toolkit). The generator uses two modes of operation:

Part of speech (POS) tagger alone
Using the dependency parser with the POS tagger

Using the POS tagger alone, the text “Java developer” is recognized as two independent nouns: “Java” and “developer.” The dependency parser recognizes compound nouns and finds a relation with “Java” and “developer.”

The results of both approaches are summarized for POS tagger alone (mode 1) and dependency parser with POS tagger (mode 2) in the table below:

Note: The summary filter depends on the DP (domain pertinence filter) value, the DC (domain consensus) value, and a small coefficient.

The dependency parser (mode 2 in the table) produced significantly more terms after parsing than the POS tagger (mode 1). The dependency parser adds compound terms to the same terms that make the compound. This implies that the parts are encountered in the text corpus independently.

Both parsers find “Java” and “developer,” but the dependency parser finds “Java developer” as a compound resulting in three terms (instead of two).

Analyzing the results

The dependency parser produced more useful results, discovering approximately twice as many relations in about half as many domains. There is a significant recall among the relations and terms, as you might expect, some domains had much better coverage than others in our existing ontology.

We can confidently bootstrap a significant number of domains that haven’t had any updates for some time.

For the voice talent domain, approximately 400 terms were discovered and worth including in the ontology. Among these terms, the software found some new relations. However, we determined that the software missed quite a few relations (shown in the figure below):

‍

There is a relation from “voice” to ”voice actress” and from “voice actor” to ”audiobook narration.” But there isn’t a relation from “voice” to ”voice actor” or from “voice actress” to ”audiobook narration.” Those gaps could easily result in missed, useful relations.

Improving the results

At the start of our ontology work, we focused on discovering the highest quality data. This has two benefits in determining the best approach:

Analysis should take a much shorter amount of time
Resources would be allocated to critical projects with a better chance of success

A critical, near-term goal is to experiment with the threshold that defines a relation. As some relations are getting lost, the current threshold of 0.4 appears to be too high. We could always lower the thresholds after we get a baseline. At Upwork, our ontologists are working on adding the discovered terms to our ontology work. Now confident that our approach is worth pursuing, the third blog in this series explores toolkits and SDKs.