Automatic Ontology Generation (Part 1: Basics)

May 3, 2022

4 min

Upwork uses ontology throughout our website in multiple locations, including semantic search, browse paths, and SEO. For some time, ontology graphs were manually created. A team of ontologists created curated taxonomies for different categories, creating occupation-specific hierarchies. While that approach had many benefits, it created some complex issues.

Coverage was an issue since we were never sure if current taxonomies covered all profiles and jobs. There could've been entire unrepresented occupations in our marketplace. Although ontologies should be designed with flexibility in mind, terminology in some occupations changes quickly. Our ontologists couldn’t always keep up.

The need for automation

To overcome these limitations, the approach needed to change. There are two goals:

Candidate ontology updates need to be automatically produced on a regular basis.
Ontologists should be able to quickly curate and update production ontologies with minimal process and human involvement.

I decided to implement automated ontology discovery and building based on real data. Using a semantic approach for extracting taxonomies from text [1], two key benefits were identified:

Quickly produce updates: The team used non-supervised methods to build ontology. That’s precisely what is required at Upwork. We need to quickly produce updates and bootstrap empty occupations for later improvement by the ontology team.
Code written in Java: The original software was written in Java. The use of Java gave us a high degree of assurance that the process can be scaled by parallelization. Since most software development at Upwork is written in Java, creating and maintaining the code by any developers should be possible.

In part 1 of this three-part series, I’ll describe the functionality we implemented in the project’s first stage.

I decided to implement a simplified version of the process as the first iteration. Following a preparatory parsing of text to nouns, our process includes three simple steps:

Step 1: Filtering

When a user creates a profile or posts a job, Upwork’s UI prompts the user to select the main category it belongs to. According to the semantic approach in extracting domain taxonomies [1], the first filter is domain pertinence (DP). This filter determines how specific is the term in the given business domain. The algorithm used is DP = freq(t/Di)/maxj(freq(t/Dj), where t is the term we are filtering, Di is the current domain, and Dj is any other domain. In other words, the frequency of a term in a given domain is divided by the maximum frequency of that term across all domains, except this one.

If the term is specific to the given domain, it has no presence in any other domain, and the divisor is set to 1. All terms with values below 30 percent term DP value mark are eliminated from further evaluation.

Next, the domain consensus (DC) filter is calculated:

This filter determines the popularity of term t in the documents dk of domain Di. And, nfreq is the normalized frequency of term t in the document dk, calculated as frequency in this document divided by the maximal frequency of that term in any document of any domain. The filter penalizes terms with higher frequencies per document while rewarding terms with occurrences in more documents of a domain.

Finally, the last filter calculates summary filter (SF) as a linear combination of normalized DC, DP, and k. Assuming that terms found in a title are more important, constant k is set to 0.02 if the term has been present in a title of any document in that domain.

For terms not present in a title, k is 0. The value of k was chosen based on meta-parameter optimization done by the authors of the original publication [1]. I suspect that k is primarily text corpus-dependent; however, parameter optimization is something that can be performed at a later stage, once we verify the product provides good updates for our ontology. The current value of k should work well for a wide range of text data.

The algorithm is SF = 0.4*norm(DP(Di,t))+0.6*norm(DC(Di,t))+k. For normalization purposes, the filter’s value of term t in domain Di was divided by the maximal value of that type of filter in that domain. All terms with values below 40 percent term value SF mark were eliminated.

Step 2: Building relations

For the remaining terms, relations are built using the subsumption method. This method is based on co-occurrence of terms. If there are 2 terms specific to a domain and one term shows up only (or mostly) in the presence of another; while the other one occurs in more documents than documents containing the first term, the second term subsumes (“wider”) than the first (the first is “narrower”).

For example, in the “Art and Illustration” domain, “anime” can be a “wider” concept to domains like “chibi” and “fanart.”. The formula we used is: P(x j y) >= t; P(y j x) < t ,; where t is a threshold value set to 0.4. Unlike what the authors of the original article examined, Upwork established a minimal number of documents the terms need to appear in.

Upwork works with a large number of profiles and job posts, compared to the set of documents used in the original article. As ontology influences “search and match,” terms specific to a single profile or job posts should be avoided for a variety of reasons (including bloating).

Step 3: Checking the ontology

As the final verification, we check for the appearance of terms and relations found in the existing ontology. We are trying to achieve multiple goals with this check:

The coverage of the ontology is calculated and compared against the terms from the new crop of profiles or other documents.
The recall value is calculated to see if it’s time to recalibrate the meta parameters.
Based on various similarity criteria, places to plug in newly found terms are identified.
By removing manual labor, we look for ways to simplify the job of our ontology team.

Finally, we rejoice! :)

In part 2 of this three-part series on ontology, I’ll demonstrate how the results of this process can be applied to freelancer profiles.

—————

1.Meijer, Kevin, Flavius Frasincar, and Frederik Hogenboom. “A Semantic Approach for Extracting Domain Taxonomies from Text.” ResearchGate. June 6, 2014. Link to resource.

‍