Our task is about feature extraction from web services for aim of segmentation. The attached data set contains wsdl files that represent web services. The folder “services by domain” contains the services categorized by its domain. We need to check after making segmentation how much the segmentation process matches the pre-defined categorization using two measures; precision and recall. Our aim is to reach average precision and recall above 90%.
The following is description of proposed clustering methods and text processing steps to achieve our goal, so they are not obligatory if you can achieve the goal using alternative methods but the processing time to get the result must be acceptable comparable to k-means.
We propose to make segmentation process using unsupervised clustering k-means where Similarity between services calculated as similarity between four extracted features (service name, content words, complex types, and referenced ontologies) using Normalized google index
Milestone 1: $75.00
get csv sheet represent the example set for the attached wsdl files in this structure
Milestone 2: $125.00
Calculating similarity between services for each extracted feature (service name, content words, complex types, and referenced ontologies) and clustering the services using the integrated similarities with the aimed precision and recall values.
Text Pre-processing and content feature extraction
The aim of this step, is to create example set from the attached wsdl files with four extracted features (service name, content words, complex types, and referenced ontologies) in addition to label attribute which is the name of the containing folder for the wsdl file. Service label will be used to evaluate the precision and recall of the clustering result.
For compound words that don’t follow Camel case, you can use this solution: https://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/
Or the proposed algorithm in section “Implementation Guides”
By calculating the average NGD between each of the two clusters and a predefined vector of general computing words such as f runtime, bind, web, service, module, data, post, developer g. The cluster closest to this oracle is determined to be the non-Web-service-specific cluster and its words are removed from the word vector Ti.
Features Similarity Calculation
Feature 3: Service Type (ComplexType)
Feature 4: Referenced Ontologies
The data set files have some semantic annotation by refereeing to external ontologies parts that define contained concepts as shown in this image. For each service, a list of referenced ontologies parts will be extracted.
The similarity for this feature will calculated just as types in equation (6)
Similarity integration for Clustering
We measure the service similarity between web services s_i,s_j as follow:
〖Sim〗_service (s_i,s_j )=w1 〖Sim〗_content (s_i,s_j )+ w2 〖Sim〗_name (s_i,s_j )+
w3 〖Sim〗_types (s_i,s_j )+
w4 〖Sim〗_RefOnt (s_i,s_j )+
where w1, w2, w3, and w4 are the user-defined weights of Content, name, Type, and referenced ontologies respectively, with their sum being unity. Our proposed values as follow: W1=0.5, W2=0.1, w5=0.2, and w4=0.2