ABSTAT is an ontology-driven linked data profiling framework that aims to help users in an effortless understanding of the data.
Given a RDF data set and, optionally, an ontology (used in the data set), ABSTAT computes a semantic profile that consists of a summary and statistics. ABSTAT’s summary is a collection of patterns known as Abstract Knowledge Patterns (AKPs) of the form <subjectType, pred, objectType>, which represents the occurrence of triples <sub, pred, obj> in the data, such that subjectType is a minimal type of the subject and objectType is a minimal type of the object. With the term type, we refer to either an ontology class (e.g., foaf:Person) or a datatype (e.g., xsd:DateTime). Profiles are published and made accessible via web interfaces, in such a way that the information that they contain can be consumed by users and machines (via APIs).
The main features of ABSTAT are:
- It adapts a minimalization approach producing profiles that are compact and complete. By considering only minimal types of resources, computed with the help of the data ontology, we exclude several redundant AKPs from the summary making them compact and complete. The compactness of the profile refers to the fact that the number of patterns that a user should explore to understand the data is very low with respect to the number of triples in the dataset. With a complete profile, ABSTAT refers to the fact that every pattern not present in the profile can be easily inferred by subtype graph computed from the ontology.
- For each pattern, several statistics are returned. Considering the one in the black box, the frequency of the pattern shows how many times does this pattern occurs in the data set. The number of instances shows how many instances have this pattern including those for which the types Person and Film and the predicate knownFor can be inferred. Max (Min, Avg) subjs-obj cardinality is the maximal (minimal, average) number of distinct entities of type Person linked to a single entity of type Film through the predicate knownFor. Max (Min, Avg) subj-objs is the maximal (minimal, average) number of distinct entities of type Film linked to a single entity of type Person through the predicate knownFor. Frequency is given also for types and predicates.
- SHACL validation: For each pattern, a SHACL (Shapes Constraint Language) profile is generated. As SHACL is used to validate constraints in the data, cardinality statistics generated by ABSTAT are used as constraints. Several heuristics are applied for the generation of such constraints. In order to have a first skimming of the data, we apply the heuristics of Max cardinality as twice the average.
Areas of Application
- Data understanding. ABSTAT summaries provide a complete overview of the content of a data set; this feature was proved to be useful to support, for example, SPARQL query formulation.
- Vocabulary matching for table annotation. Summaries record rich statistics about the usage of vocabularies/ontologies in data sets. Thus we can use summaries to provide types and properties that match a string, for example, the header of a column in a table that we want to publish in RDF reusing existing vocabularies. In addition, statistics provide valuable information to algorithms aimed at suggesting the best properties and types to use when transforming a tabular data into RDF.
- Data quality: Summaries and cardinality profiles can detect data quality issues in RDF knowledge graphs. For example, to find quality issues in company-related data in DBpedia, one can look at ABSTAT’s profiles. One would find that there is one instance of the concept owl:Thing that is specified as the object of the dbo:keyPerson property of as many as 5263 different companies (dbo:Company). If we combine this evidence with the evidence that the average number of distinct objects associated via the dbo:keyPerson property to instances of dbo:Company is three, we can conclude that there is an outlier in the dataset. A query to the knowledge graph would explain that the instance of owl:Thing specified as key person of 5263 is a generic instance named dbr:ceo, which is associated in DBpedia to companies that have no other key person associated with. To sum up, ABSTAT profiles can support the identification of anomalous schema-patterns.