Vocabulary Development By Convention

Irlán Grangel-González, Lavdim Halilaj, Gökhan Coskun and Sören Auer

Enterprise Information Systems, University of Bonn, Germany

{grangel, halilaj, coskun, auer}@cs.uni-bonn.de

Index


Analysis of Widely Used Vocabularies
Vocabulary Development By Convention


Analysis of Widely Used Vocabularies

In order to compile a list of the 20 most widely used vocabularies, which should be analyzed, we did twofold. Firstly, we looked for recognized ontologies that have a good documentation, dereferenceability and are used by independent data providers 3 . This led to six very popular ontologies. Secondly, we used the Linked Open Vocabularies (LOV) 4 web page that provides details about the frequency of reuse for each vocabulary. We extended the list with the 14 most reused vocabularies which contain classes. We considered these as the most successful vocabularies that build the ground for our analysis, which we call authoritative vocabularies.

The main purpose was to identify their common features and extract best practices of vocabulary development. In this regard, we wanted to understand important aspects of vocabulary creation such as reuse, internationalization, documentation and naming as well as the implicit structure of these vocabularies (e.g. use of logical axioms, property domain/range definitions).

With respect to Reuse, 80% of the vocabularies make use of vocabulary elements defined elsewhere and 57% reuse elements from at least two external ontologies. This shows a considerable presence of the reuse aspect in the studied cases.

One of the most important aspects of Internationalization (I18n) is the support for multi-linguality. In vocabularies this can be implemented by providing textual values for properties such as rdfs:label, rdfs:comment in different languages (using different language tags for RDF string literals). In 71% of the vocabularies we encountered explicit English literals ( @en), but only in two (i.e. 9.5%) we found a translation of the terms into other languages. In the remaining 29% of the cases, there were no explicit language tags used at all. Consequently, despite I18n being important for existing ontologies we discovered that the common practice currently is to support only English.

Documentation refers to the addition of human readable labels and descriptions (using the properties rdfs:label, rdfs:comment) to the vocabulary elements (i.e. classes, properties and individuals). We encountered that rdfs:label or rdfs:comment are present in 86% of the cases. It is worth noting that the combination of the two above mentioned elements with rdfs:isDefinedBy is used with a frequency of 57%. Only in one case (i.e. 5%) we did not find any form of documentation. This shows that documentation (i.e. rdfs:label, rdfs:comment for commenting, and rdfs:isDefinedBy for linking definitions) is widely used by existing vocabularies.

Another important practice in vocabulary creation is the convention for Naming elements. The CamelCase notation was with 62% of the cases the most used one. In all other cases (i.e. 38%) no homogeneous naming convention could be identified. A combination of CamelCase notation, underscore or dash sign were used instead.


Table: Authoritative Vocabularies
Name Prefix Domain
Friend Of A Friend - http://xmlns.com/foaf/0.1/ foaf Terms related to Persons (i.e. Agent, Document, Organization, etc).
Dublin Core ontology Terms - http://purl.org/dc/terms/ dcterms General metadata terms (i.e. Title, Creator, Date, Subject, etc).
WGS84 Geo Positioning - http://www.w3.org/2003/01/geo/wgs84_pos# geo Represents longitude and altitude information in the WGS84 geodetic reference datum.
Socially Interconnected Online Communities ontology - http://rdfs.org/sioc/ns# sioc Aspects of online community sites (i.e. Users, Posts, Forums, etc).
Simple Knowledge Organization System Namespace - http://www.w3.org/2004/02/skos/core# skos Common data model for sharing and linking knowledge organization systems.
Vocabulary of Interlinked Datasets - http://rdfs.org/ns/void# void Metadata about RDF datasets (i.e. Dataset, Linkset, etc).
Vocabulary for biographical information - http://vocab.org/bio/0.1/.html bio Describes biographical information about people, both living and dead (i.e. Dataset, Linkset, etc).
Data Cube Vocabulary - http://purl.org/linked-data/cube# qb Statistic data (i.e. Dimensions, Attributes, Measures, etc).
Vocabulary for Rich Site Summary - http://purl.org/rss/1.0/ rss Models the declaration for Rich Site Summary (RSS) 1.0.
Vocabulary for modeling abstracts things for people - http://www.w3.org/2000/10/swap/pim/contact# w3con Describes general concepts about people everyday life (i.e Address, Phone, Mail, etc).
Description of a Project - http://usefulinc.com/ns/doap# doap Terms for Software Projects, specifically for Open Source Projects (i.e. Version, Specification, Repository, etc).
Bibliographic Ontology - http://purl.org/ontology/bibo/ bibo Citations and bibliographic references (i.e. quotes, books, articles, etc).
Data Catalog Vocabulary
http://www.w3.org/ns/dcat#
dcat Vocabulary designed to facilitate interoperability between data catalogs published on the Web (i.e. DataCatalog, Distribution, etc).
Schema.org - http://schema.org schema Broad schema of concepts (i.e. Creative Works, Media Objects, Events, Organization, Person, etc).
GoodRelations - http://purl.org/goodrelations/v1 gr E-Commerce related terms (i.e. Products, Services, Locations, etc).
Music Ontology - http://purl.org/ontology/mo/ mo Terms related to music (i.e. Artists, Albums, Tracks, etc).
Creative Commons schema - http://creativecommons.org/ns cc Terms for describing copyright licenses (i.e. License Properties, Work Properties, etc).
GeoNames - http://www.geonames.org/ontology gn Geospatial semantic information (i.e. Population, PostalCode, etc).
MarineTLO ontology - http://www.ics.forth.gr/isl/ontology/MarineTLO/ marinetlo Marine domain (i.e. Species, Marine Animal, etc).
Event Ontology - http://purl.org/NET/c4dm/event.owl event Describes reified events (i.e. Event, location, time, active agents, ect).


Vocabulary Development By Convention

In this section, we provide a comprehensive list of practices for collaborative vocabulary development. We derived this list from our own experience in creating vocabularies like SCORVoc 5 and MobiVoc 6 in combination with the results of the aforementioned analysis. It will serve as guidelines that help to focus on the most important aspects of vocabulary creation process. Therefore, it is expected to increase the efficiency of the collaboration and to improve the overall quality of the vocabulary. For the sake of clarity, it should be kept in view that we consider vocabularies as lightweight ontologies. fig:phases depicts the main aspects of our approach, which are described in detail in the remainder of this section.

However, it is worth mentioning that these guidelines are independent of the concrete development environment. They can be applied within various circumstance. Nevertheless, we would like to encourage the adoption of a well-known distributed versioning control system as the basic collaboration support environment. In this regard, we have chosen Git due to the following two rationales. On the one hand, Git is a mature versioning control system supported by very sophisticated tools. Therefore, it is broadly used in software development projects. More than 10 million repositories 7 are hosted in GitHub for open source and commercial projects [ Kalliamvakou et al., 2015]. On the other hand, existing popular vocabularies like schema.org 8, Description of a Project (DOAP) 9, the music ontology 10 also use GitHub to facilitate the collaboration of the respective communities. This indicates that the vocabulary development community is already familiar with Git.

Figure: Main aspects of Vocabulary Authoring.

Reuse

In the current context of vocabulary construction, the reuse of existing terms is an aspect of vital importance [ Poveda-Villalón, 2012, Pedrinaci et al., 2014]. The main idea is not to create new terms but to utilize those that are present in the existing vocabularies and to avoid redundant work. Apart from saving time and investment costs, ontology reuse is expected to ensure a certain level of quality. The reason for this is that the longer an ontology exists and is reused, the more review processes it has gone through. Furthermore, in the context of the Semantic Web ontologies are considered the shared conceptualizations of distributed information systems. In this regard, ontology reuse is also expected to support interoperability and system integration. Additionally, according to [ Heath and Bizer, 2011] reuse is considered to be a best-practice in vocabulary construction. Therefore, in the following we discuss important practices regarding reuse.

P-R1 Reuse of authoritative vocabularies

We define authoritative vocabularies as vocabularies which are (1) published by renowned standardization organizations; (2) used widely in a large number of other vocabularies and (3) defined in a more domain independent way addressing more general concerns. Hence, these most widely used vocabularies should be considered as a first option for reuse (cf. our list of authoritative vocabularies in tab:vocabs).

P-R2 Reuse of non-authoritative vocabularies

Search online resources, such as vocabulary registries like LOV 11 and LODStats 12 or ontology search engines like Swoogle 13 and Watson 14 to find terms to reuse. The output of this process is a set of terms. These terms should be ranked taking care about the number of datasets that use it, the amount of instances and the frequency in which the vocabulary or the term is reused [ Pedrinaci et al., 2014]. Also, the semantic description and definition of the term should be checked in order to test if it fits the intended use.

P-R3 Avoid semantic clashes

If the term has a strong semantic meaning for the domain, different from the existing ones, then a new element should be created.

P-R4 Individual resource reuse

Especially elements from authoritative vocabularies should be reused as individual vocabulary elements. For non-authoritative vocabularies a reuse of individual identifiers is less recommendable and the creation of own vocabulary elements with a possible alignment (cf. P-R7) or the reuse of larger modules (cf. P-R5) should be considered.

P-R5 Vocabulary module reuse

(Opposite of P-R4) Often vocabularies require certain basic structures such as addresses, persons, organizations, which are already defined in non-authoritative vocabularies. Such structures comprise usually the definition of one or several classes and a number of properties. If the conceptualizations match the complete reuse of a whole module should be considered.

P-R6 Establishing alignments with existing vocabularies

Instead of the strong semantic commitment of reusing identifiers from non-authoritative vocabularies, alignments using owl:sameAs, owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, rdfs:subPropertyOf can be established.

With the goal to clarify some of the proposed practices we set the following example. We want to reuse the term Person. Then, we start by using the statistics provided by LOV for this term.

Table: Reuse table for term Person
Terms Occurrences Datasets Reused by Documented Naming Conventions Multilinguality
foaf:Person 2,320,027 72 307 Yes Yes No
npg:Person 0 0 0 Yes No Yes
bbccore:Person 0 0 0 No Yes Yes
schema:Person 980,153 2 44 Yes Yes No
akt:Person 3,183,315 23 5 No Yes No

The tab:reuse clearly shows that the best two candidates for reusing are foaf:Person and schema:Person. In this case, there is a significant amount of occurrences for akt:Person but it is only reused by other 5 vocabularies. Also, it does not belong to the authoritative vocabularies. In this case we should chose between our two best ranked candidates. In order to do this, we should check in the ontology that will be reused, not only the ranked values but also if (1) the term is minimally documented (i.e. the use of rdfs:label and rdfs:comment or skos:prefLabel and skos:definition) ; (2) the term follows some naming conventions (i.e. CamelNotation or other notation used consistently) and (3) the term is annotated at least in English (i.e. the use of rdfs:label "Person"@en;). In this example, the best choice to reuse is the class foaf:Person. In this specific example, it is important to note that, there exist an alignment between the classes of our two best ranks candidates. Both ontologies, FOAF and Schema.org, contain owl:equivalentClass axioms that align this term Person respectively. This means that regarding the semantic purpose of the design any of those classes can be reused. Despite of this fact, we encourage the observations of all the mentioned criteria for reuse in order to have a clean and documented design of the ontology.

Vocabulary Structure

The larger and more complex a vocabulary is, the more difficult is the development and maintenance process. In this regard, modularization is an important aspect of collaborative vocabulary development [ Suarez-Figueroa et al., 2012]. [ Poveda-Villalón, 2012] describes an ontology module as a loosely coupled and self-contained component of an ontology that keeps relationships with other ontology modules. Even though in some cases ontology modules are considered to be independent ontologies [ d’Aquin et al., 2008], from the development perspective components are not treated as independent elements.

Organizing a vocabulary in files where each file represents a module, is a way of managing modularity within the development process. Furthermore, some reports show that a module in a mid-sized vocabulary should contain between 200 and 300 lines of code [ Schlicht and Stuckenschmidt, 2006]. Since modularity depends on the overall size of the vocabulary, we see the following three possibilities how the file structure should be organized with respect to modularity.

P-S1 One file for the whole vocabulary

When the vocabulary is small (e.g. contains less than 300 lines of code) and represents a domain which cannot be divided in sub domains, it should be saved within one single file. If the number of contributors is relatively small and the domain of the vocabulary is very focused, organizing it into one single file might be possible, even if it exceeds 300 lines of code. However, if the comprehensibility is exacerbated, splitting it into different files should be considered (P-S2).

P-S2 Multiple files

If the vocabulary contains more then 300 lines of code or if it covers a more complex domain, it should be organized into different subdomains. When the subdomains themselves are small enough they should be represented by different files within the parent folder. In this case, domain experts can contribute independently by modifying modules according to their field of expertise.

P-S3 Multiple files and folders

In case of very large vocabularies comprising complex domains, splitting the whole vocabulary into files is not sufficient, as it would lead to a large amount of files within a single folder. Therefore, the subdomains should be represented by folders if they are large enough to be split into different components represented by different files. In this case, the folder and file structure should reflect the complex hierarchy of the overall domain.

P-S4 Separation of TBox and ABox

Ontologies comprise a terminological part (TBox) and the individuals (ABox). Even though vocabularies are more about the terminological parts, they can also contain individuals. However, authoritative vocabularies reveal that in most cases the number of individuals is quite small. For the sake of completeness, it is worth mentioning that due to ambiguity of the term vocabulary, in some cases the ABox can grow and even exceed the size of the TBox significantly. In this case, it should be considered to split the whole vocabulary into different files according to the TBox and the ABox. This will increase the performance of reasoning process [ Sirin et al., 2006].

Naming Conventions

Following naming conventions has a high impact in vocabulary development [ Schober et al., 2012]. They provide guidance to the collaborators about the use of specific styles for naming elements. Naming conventions help to avoid lexical inaccuracies and increase the robustness and exportability, specifically in cases when vocabularies should be interlinked and aligned with each other [ Schober et al., 2009]. The utilization of meaningful names increases the robustness of context-based text mining for automatic term recognition and ease the manual and automated integration of terminological artifacts (i.e. comparison, checking, alignment and mapping) [ Svátek and Šváb-Zamazal, 2010, Schober et al., 2012].

There exists some naming patterns for ontologies 15 aiming to determine best practices for naming the elements for ontologies. Considering the literature on this topic [ Schober et al., 2007, Schober et al., 2009, Montiel-Ponsoda et al., 2011] and the results we propose some practices to be observed regarding the naming of vocabulary terms. For vocabulary construction, the use of the CamelCase notation is a best practice [ Svátek et al., 2009]. Our study also indicated the presence of this notation in 62% of the cases. Therefore, we propose the observation of this specific notation to be used in vocabulary construction.

P-N1 Concepts as single nouns

Name all concepts as single nouns using CamelCase notation (i.e. PlanReturn).

P-N2 Properties as verb senses

Name all properties as verb senses also following CamelCase approach. The name of an property should not normally be a plain noun phrase, in order to clearly distinct from class names (i.e. hasProperty or isPropertyOf).

P-N3 Short names

Provide short names and for providing natural names with more than three nouns use the rdfs:label property with the long name. For instance, for ManageSupplyChainBusinessRules use BusinessRules and set the full name in the label. In order to explain the context (i.e. Supply Chain), complement this label with the skos:altLabel (cf. p:skos-altLabel).

P-N4 Logical and short names for namespaces

Assign logical and short names to namespaces preferable, with no more than five letters (i.e. foaf:XXX, schema:XXX).

P-N5 Regular space as word delimiters for labeling elements

For example, rdfs:label "A Process that contains..".

P-N6 Avoid the use of conjunctions and or words with ambiguous meanings

Avoid names with ``And'', ``Or'', ``Other'', ``Part'', ``Type'', ``Category'', ``Entity'' and those related to datatypes like ``Date'' or ``String''.

P-N7 Use positive names

Avoid the use of negations. For instance, instead of NoParkingAllowed use ParkingForbidden.

P-N8 Respect the names for registered products and company names

In those cases the CamelNotation should not be used and the name of the company or product should be used as is (i.e. SAP, Daimler AG).

Dereferenceability

One of the four rules that should be followed during vocabulary development is naming things with HTTP URIs 16. Adopting HTTP URIs for identifying things is appropriate due to the following reasons: (1) it is simple to create global unique keys in a decentralized fashion and (2) the generated key is not used just as a name but also as an identifier.

By combining dereferenceability with content negotiation 17, the server will provide adequate content for a resource based on the type of request. There are three different strategies to make URIs of resources dereferenceable: (1) slash URIs; (2) hash URIs and (3) a combination between them 18.

P-D1 Use slash URIs

When the client request a resource from server by providing its URIs, the server response will be 303 see other. Slash URI should be used when dealing with large datasets. This makes the server to response only with requested resource. An example of using slash URIs is as follows:

In Mobivoc 19 vocabulary the URI that identifies the ChargingPoint resource is:

http://purl.org/net/mobivoc/ChargingPoint

The URI of turtle representation of above resource should be:

http://purl.org/net/mobivoc/ChargingPoint.ttl

and URI of html representation should be:

http://purl.org/net/mobivoc/ChargingPoint.html


In order to get information about ChargingPoint, the client provides URI and specify request type. In turn server response will be 303 see other by redirecting to appropriate representation.

P-D2 Use hash URIs

This solution is formed by including a fragment to the URIs. The fragment is separated from the rest of URI by using hash symbol ( #). Use hash URIs when dealing with small datasets. This will reduce number of HTTP round trips. An example of using hash URIs is:

The URI of the ScorVoc 20 vocabulary is:

http://purl.org/eis/vocab/scor

The URI of the Process resource is:

http://purl.org/eis/vocab/scor#Process

P-D3 Use combination between slash and hash URIs

This allows a large dataset to be split into multiple fractions. Use this solution when datasets may grow to some point where it is not practical to serve all resources in single document. An example of such combination is:

http://purl.org/eis/vocab/scor/Process#this

P-D4 Configure server to provide content negotiation

With the purpose of delivering content based on request type, server should be configured in accordance to the best practices for publishing RDF vocabularies 21.

Multilinguality

To provide a multilingual design for ontologies is desirable but not an straightforward issue [ Gracia et al., 2012]. According to our empirical analysis and aiming at keeping things simple we propose the following best-practices.

P-M1 Use English as the main language

Use English for every element and explicitly set with the @en notation.

P-M2 Multilinguality for other languages

In order to add another language, use another line adding the same format for every element. The following example illustrates this practice with translations for the class SupplyChain.

scor:SupplyChain rdf:type owl:Class ;
                                                                    
rdfs:label "SupplyChain"@en;
rdfs:comment "A Supply Chain is a ..."@en ;
rdfs:label "Lieferkette"@de;
rdfs:comment "Eine Lieferkette ist ..."@de.
This approach should be followed with all the elements starting from the basics ones like rdfs:label and rdfs:comment but also for the external annotation properties (i.e. skos:prefLabel).

Documentation

Providing user friendly view of vocabularies for non-experts is crucial for integrating Semantic Web with everyday Web [ Peroni et al., 2013]. It facilitates contribution of domain experts during the development process. In addition, it helps other interested parts for easy use of vocabulary in later phases as well. There exists different tools for documentation generation. Basically, these tools requires that following information should be present for each resource to enable generation process.

P-Do1 Use of rdfs:label and rdfs:comment

Add a rdfs:label to every element setting the main name of the concept that is being represented and rdfs:comment to describe the context for which the element is created.

P-Do2 Generate human-readable documentation

Easy-to-use documentation is critical for the wide adoption of the vocabulary. As described in subsection of Dereferenceability, we have two different types of URIs. In case that during vocabulary creation slash URIs are used for identifying resources then tools like Schema.org documentation generation 22 should be used for documentation generation. Tools like Parrot 23 are appropriate if hash URIs or combination between slash and hash URIs are used for identifying resources.

Validation

Validation is an important aspect in the ontology development process [ Yu et al., 2009, Poveda-Villalón et al., 2012]. It analyzes whether ontology correctly represents the knowledge domain in accordance to user requirements and best practices [ Gómez-Pérez et al., 2006, Yu et al., 2009, Kezadri and Pantel, 2010]. Criteria used for validation activity are: (1) correctness; (2) completeness and (3) consistency [ Suárez-Figueroa, 2010]. With the purpose of addressing the above mentioned criteria, we propose the following practices.

P-V1 Syntax validation

When collaborating directly on the vocabulary source code, syntax checking is of paramount importance. Ideally syntax checking is directly integrated into the editor and committing violating code is not possible. For example, tools such as Parrot 24 or Web-based tools such as: RDF Validation Service 25 or OWL2 Validator 26 can be used for finding common typos and syntax errors.

P-V2 Code-Smell checking

Code smells are symptoms in the software source code that possibly indicate deeper problems. Similarly tools such as OOPS 27 can be used for vocabulary smell checking. OOPS is a Web-based tool for detecting common ontology pitfalls such as: (1) missing relationships; (2) using incorrectly ontology elements and (3) missing domain and range properties. The complete list of pitfalls that are detected by OOPS is presented in [ Poveda-Villalón et al., 2012].

P-V3 Consistency checking

Since we deal in vocabularies with lightweight ontologies it is not very likely to have axioms that produce semantic inconsistencies. Nevertheless, our analysis showed that in authoritative vocabularies there are cases that lead to semantic inconsistencies (i.e. class disjointness). Handling inconsistencies impacts the quality of ontologies [ Abburu, 2012]. Tools like Pellet 28, Fact++ 29, Racer 30, HermiT 31 or the Web based tool ConsVISor 32 should be used for consistency checking.

P-V4 Linked Data validation

Tools such as Vapour 33 verify whether data are correctly published according Linked Data principles and the best publishing practices 34.

Authoring

In first section, we analysed common practices that are followed by vocabulary engineers (i.e. the creation of object properties and their associated domain and range axioms). It is important to remark that those practices are always domain dependent, but still can serve as general guidelines to be followed in the process of designing vocabularies.

P-A1 Domain and range definitions for properties

When creating a property, consider to provide the associated domain and range definitions. This also means that in case of object properties the corresponding classes should be defined. In case of datatype properties, the range should be a suitable datatype.

P-A2 Avoid inverse properties

Create inverse properties only if it is strictly necessary to have a relation in two directions (i.e. invalidated and wasInvalidatedBy). Inverse properties affect the size as well as the complexity of the vocabulary.

P-A3 Use of class disjointness

Use class disjointness to logically avoid overlapping classes. Even though disjointness has been used in authoritative vocabularies, it should be carefully examined because it can easily lead to semantic inconsistencies.

Utilization of SKOS Vocabulary

The Simple Knowledge Organization System ( SKOS) 35 is a W3C recommendation for modeling vocabularies in the Web. SKOS is currently used by at least 478 vocabularies [ Haslhofer et al., 2013]. The utilization of some SKOS constructs is considered a best practice for declaring and documenting indexing terms (i.e. skos:prefLabel) and alternatives terms (i.e. skos:altLabel) [ Manaf et al., 2012, Baker et al., 2013]. Both above mentioned properties are subproperties of rdfs:label. SKOS provides a more detailed notion of the labeling concept, which can be useful for better descriptions of the terms.

P-A4 Provide skos:prefLabel to complement the labeling of concepts

The content of the labels can be different showing some difference between the semantics of each property. For instance, skos:prefLabel might describe some shorter definition for the concept than rdfs:label.


P-A5 Use skos:altLabel to describe variations of the concept

Alternative labels can complement the label with the acronyms, abbreviations, spelling variants, and irregular plural/singular forms for a concept.