Irlán Grangel-González, Lavdim Halilaj, Gökhan Coskun and Sören Auer
Enterprise Information Systems, University of Bonn, Germany
{grangel, halilaj, coskun, auer}@cs.uni-bonn.de
In order to compile a list of the 20 most widely used vocabularies, which should be analyzed, we did twofold. Firstly, we looked for recognized ontologies that have a good documentation, dereferenceability and are used by independent data providers 3 . This led to six very popular ontologies. Secondly, we used the Linked Open Vocabularies (LOV) 4 web page that provides details about the frequency of reuse for each vocabulary. We extended the list with the 14 most reused vocabularies which contain classes. We considered these as the most successful vocabularies that build the ground for our analysis, which we call authoritative vocabularies.
The main purpose was to identify their common features and extract best practices of vocabulary development. In this regard, we wanted to understand important aspects of vocabulary creation such as reuse, internationalization, documentation and naming as well as the implicit structure of these vocabularies (e.g. use of logical axioms, property domain/range definitions).
With respect to Reuse, 80% of the vocabularies make use of vocabulary elements defined elsewhere and 57% reuse elements from at least two external ontologies. This shows a considerable presence of the reuse aspect in the studied cases.
One of the most important aspects of Internationalization (I18n) is the support for multi-linguality. In vocabularies this can be implemented by providing textual values for properties such as rdfs:label, rdfs:comment in different languages (using different language tags for RDF string literals). In 71% of the vocabularies we encountered explicit English literals ( @en), but only in two (i.e. 9.5%) we found a translation of the terms into other languages. In the remaining 29% of the cases, there were no explicit language tags used at all. Consequently, despite I18n being important for existing ontologies we discovered that the common practice currently is to support only English.
Documentation refers to the addition of human readable labels and descriptions (using the properties rdfs:label, rdfs:comment) to the vocabulary elements (i.e. classes, properties and individuals). We encountered that rdfs:label or rdfs:comment are present in 86% of the cases. It is worth noting that the combination of the two above mentioned elements with rdfs:isDefinedBy is used with a frequency of 57%. Only in one case (i.e. 5%) we did not find any form of documentation. This shows that documentation (i.e. rdfs:label, rdfs:comment for commenting, and rdfs:isDefinedBy for linking definitions) is widely used by existing vocabularies.
Another important practice in vocabulary creation is the convention for Naming elements. The CamelCase notation was with 62% of the cases the most used one. In all other cases (i.e. 38%) no homogeneous naming convention could be identified. A combination of CamelCase notation, underscore or dash sign were used instead.
Name | Prefix | Domain |
---|---|---|
Friend Of A Friend - http://xmlns.com/foaf/0.1/ | foaf | Terms related to Persons (i.e. Agent, Document, Organization, etc). |
Dublin Core ontology Terms - http://purl.org/dc/terms/ | dcterms | General metadata terms (i.e. Title, Creator, Date, Subject, etc). |
WGS84 Geo Positioning - http://www.w3.org/2003/01/geo/wgs84_pos# | geo | Represents longitude and altitude information in the WGS84 geodetic reference datum. |
Socially Interconnected Online Communities ontology - http://rdfs.org/sioc/ns# | sioc | Aspects of online community sites (i.e. Users, Posts, Forums, etc). |
Simple Knowledge Organization System Namespace - http://www.w3.org/2004/02/skos/core# | skos | Common data model for sharing and linking knowledge organization systems. |
Vocabulary of Interlinked Datasets - http://rdfs.org/ns/void# | void | Metadata about RDF datasets (i.e. Dataset, Linkset, etc). |
Vocabulary for biographical information - http://vocab.org/bio/0.1/.html | bio | Describes biographical information about people, both living and dead (i.e. Dataset, Linkset, etc). |
Data Cube Vocabulary - http://purl.org/linked-data/cube# | qb | Statistic data (i.e. Dimensions, Attributes, Measures, etc). |
Vocabulary for Rich Site Summary - http://purl.org/rss/1.0/ | rss | Models the declaration for Rich Site Summary (RSS) 1.0. |
Vocabulary for modeling abstracts things for people - http://www.w3.org/2000/10/swap/pim/contact# | w3con | Describes general concepts about people everyday life (i.e Address, Phone, Mail, etc). |
Description of a Project - http://usefulinc.com/ns/doap# | doap | Terms for Software Projects, specifically for Open Source Projects (i.e. Version, Specification, Repository, etc). |
Bibliographic Ontology - http://purl.org/ontology/bibo/ | bibo | Citations and bibliographic references (i.e. quotes, books, articles, etc). |
Data Catalog Vocabulary
http://www.w3.org/ns/dcat# |
dcat | Vocabulary designed to facilitate interoperability between data catalogs published on the Web (i.e. DataCatalog, Distribution, etc). |
Schema.org - http://schema.org | schema | Broad schema of concepts (i.e. Creative Works, Media Objects, Events, Organization, Person, etc). |
GoodRelations - http://purl.org/goodrelations/v1 | gr | E-Commerce related terms (i.e. Products, Services, Locations, etc). |
Music Ontology - http://purl.org/ontology/mo/ | mo | Terms related to music (i.e. Artists, Albums, Tracks, etc). |
Creative Commons schema - http://creativecommons.org/ns | cc | Terms for describing copyright licenses (i.e. License Properties, Work Properties, etc). |
GeoNames - http://www.geonames.org/ontology | gn | Geospatial semantic information (i.e. Population, PostalCode, etc). |
MarineTLO ontology - http://www.ics.forth.gr/isl/ontology/MarineTLO/ | marinetlo | Marine domain (i.e. Species, Marine Animal, etc). |
Event Ontology - http://purl.org/NET/c4dm/event.owl | event | Describes reified events (i.e. Event, location, time, active agents, ect). |
In this section, we provide a comprehensive list of practices for collaborative vocabulary development. We derived this list from our own experience in creating vocabularies like SCORVoc 5 and MobiVoc 6 in combination with the results of the aforementioned analysis. It will serve as guidelines that help to focus on the most important aspects of vocabulary creation process. Therefore, it is expected to increase the efficiency of the collaboration and to improve the overall quality of the vocabulary. For the sake of clarity, it should be kept in view that we consider vocabularies as lightweight ontologies. fig:phases depicts the main aspects of our approach, which are described in detail in the remainder of this section.
However, it is worth mentioning that these guidelines are independent of the concrete development environment. They can be applied within various circumstance. Nevertheless, we would like to encourage the adoption of a well-known distributed versioning control system as the basic collaboration support environment. In this regard, we have chosen Git due to the following two rationales. On the one hand, Git is a mature versioning control system supported by very sophisticated tools. Therefore, it is broadly used in software development projects. More than 10 million repositories 7 are hosted in GitHub for open source and commercial projects [ Kalliamvakou et al., 2015]. On the other hand, existing popular vocabularies like schema.org 8, Description of a Project (DOAP) 9, the music ontology 10 also use GitHub to facilitate the collaboration of the respective communities. This indicates that the vocabulary development community is already familiar with Git.
With the goal to clarify some of the proposed practices we set the following example. We want to reuse the term
Person. Then, we start by using the statistics provided by LOV for this term.
The tab:reuse clearly shows that the best two candidates for reusing are foaf:Person and schema:Person. In this case, there is a significant amount of occurrences for akt:Person but it is only reused by other 5 vocabularies. Also, it does not belong to the authoritative vocabularies. In this case we should chose between our two best ranked candidates. In order to do this, we should check in the ontology that will be reused, not only the ranked values but also if (1) the term is minimally documented (i.e. the use of rdfs:label and rdfs:comment or skos:prefLabel and skos:definition) ; (2) the term follows some naming conventions (i.e. CamelNotation or other notation used consistently) and (3) the term is annotated at least in English (i.e. the use of rdfs:label "Person"@en;). In this example, the best choice to reuse is the class foaf:Person. In this specific example, it is important to note that, there exist an alignment between the classes of our two best ranks candidates. Both ontologies, FOAF and Schema.org, contain owl:equivalentClass axioms that align this term Person respectively. This means that regarding the semantic purpose of the design any of those classes can be reused. Despite of this fact, we encourage the observations of all the mentioned criteria for reuse in order to have a clean and documented design of the ontology.
The larger and more complex a vocabulary is, the more difficult is the development and maintenance process. In this regard, modularization is an important aspect of collaborative vocabulary development [ Suarez-Figueroa et al., 2012]. [ Poveda-Villalón, 2012] describes an ontology module as a loosely coupled and self-contained component of an ontology that keeps relationships with other ontology modules. Even though in some cases ontology modules are considered to be independent ontologies [ d’Aquin et al., 2008], from the development perspective components are not treated as independent elements.
Organizing a vocabulary in files where each file represents a module, is a way of managing modularity within the development process. Furthermore, some reports show that a module in a mid-sized vocabulary should contain between 200 and 300 lines of code [ Schlicht and Stuckenschmidt, 2006]. Since modularity depends on the overall size of the vocabulary, we see the following three possibilities how the file structure should be organized with respect to modularity.
There exists some naming patterns for ontologies 15 aiming to determine best practices for naming the elements for ontologies. Considering the literature on this topic [ Schober et al., 2007, Schober et al., 2009, Montiel-Ponsoda et al., 2011] and the results we propose some practices to be observed regarding the naming of vocabulary terms. For vocabulary construction, the use of the CamelCase notation is a best practice [ Svátek et al., 2009]. Our study also indicated the presence of this notation in 62% of the cases. Therefore, we propose the observation of this specific notation to be used in vocabulary construction.
One of the four rules that should be followed during vocabulary development is naming things with HTTP URIs 16. Adopting HTTP URIs for identifying things is appropriate due to the following reasons: (1) it is simple to create global unique keys in a decentralized fashion and (2) the generated key is not used just as a name but also as an identifier.
By combining dereferenceability with content negotiation 17, the server will provide adequate content for a resource based on the type of request. There are three different strategies to make URIs of resources dereferenceable: (1) slash URIs; (2) hash URIs and (3) a combination between them 18.
In Mobivoc 19 vocabulary the URI that identifies the ChargingPoint resource is:
http://purl.org/net/mobivoc/ChargingPoint
The URI of turtle representation of above resource should be:
http://purl.org/net/mobivoc/ChargingPoint.ttl
and URI of html representation should be:
http://purl.org/net/mobivoc/ChargingPoint.html
In order to get information about ChargingPoint, the client provides URI and specify request type. In turn server response will be
303 see other by redirecting to appropriate representation.
The URI of the ScorVoc 20 vocabulary is:
http://purl.org/eis/vocab/scor
The URI of the Process resource is:
http://purl.org/eis/vocab/scor#Process
http://purl.org/eis/vocab/scor/Process#this
scor:SupplyChain rdf:type owl:Class ; rdfs:label "SupplyChain"@en; rdfs:comment "A Supply Chain is a ..."@en ; rdfs:label "Lieferkette"@de; rdfs:comment "Eine Lieferkette ist ..."@de.This approach should be followed with all the elements starting from the basics ones like rdfs:label and rdfs:comment but also for the external annotation properties (i.e. skos:prefLabel).
Providing user friendly view of vocabularies for non-experts is crucial for integrating Semantic Web with everyday Web [ Peroni et al., 2013]. It facilitates contribution of domain experts during the development process. In addition, it helps other interested parts for easy use of vocabulary in later phases as well. There exists different tools for documentation generation. Basically, these tools requires that following information should be present for each resource to enable generation process.
Validation is an important aspect in the ontology development process [ Yu et al., 2009, Poveda-Villalón et al., 2012]. It analyzes whether ontology correctly represents the knowledge domain in accordance to user requirements and best practices [ Gómez-Pérez et al., 2006, Yu et al., 2009, Kezadri and Pantel, 2010]. Criteria used for validation activity are: (1) correctness; (2) completeness and (3) consistency [ Suárez-Figueroa, 2010]. With the purpose of addressing the above mentioned criteria, we propose the following practices.