Skip to content

Commit

Permalink
fix: typo and some link to the algorithm
Browse files Browse the repository at this point in the history
  • Loading branch information
GabrieleT0 committed Dec 6, 2023
1 parent d305147 commit 733c05a
Show file tree
Hide file tree
Showing 8 changed files with 34 additions and 56 deletions.
34 changes: 19 additions & 15 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ layout: home
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/availability#sparql-endpoint">Availability/SPARQL-endpoint</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/availability#accessibility-of-the-sparql-endpoint">Availability/SPARQL-endpoint</a>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -59,7 +59,7 @@ layout: home
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/availability#rdf-dump">Availability/RDF-Dump</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/availability#accessibility-of-the-rdf-dump">Availability/RDF-Dump</a>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -88,7 +88,7 @@ layout: home
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/availability#uris-dereferenciability">Availability/URIs-dereferenciability</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/availability#derefereaceability-of-the-uri">Availability/URIs-dereferenciability</a>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -233,7 +233,7 @@ richness through sameAs by using network measures</i></td>
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/interlinking#number-of-same-as-chains">Interlinking/sameAs</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/interlinking#sameas-chains">Interlinking/sameAs</a>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -310,7 +310,7 @@ richness through sameAs by using network measures</i></td>
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/performance">Performance/Latency</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/performance#low-latency">Performance/Latency</a>
</td>
</tr>
<tr>
Expand All @@ -331,7 +331,7 @@ richness through sameAs by using network measures</i></td>
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/performance#throughput">Performance/Throughput</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/performance#high-throughput">Performance/Throughput</a>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -622,7 +622,7 @@ richness through sameAs by using network measures</i></td>
<tr>
<td rowspan="1">Algorithm</td>
<td colspan="3">
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/reputation#pagerank">Reputation/PageRank</a>
<a href="https://isislab-unisa.github.io/KGHeartbeat/quality_dimensions/reputation#reputation-of-the-dataset">Reputation/PageRank</a>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -650,9 +650,13 @@ richness through sameAs by using network measures</i></td>
</td>
</tr>
<tr>
<td>Output</td>
<td>[0,1]</td>
<td>Best value: 1</td>
<td rowspan="2">Output</td>
<td>0</td>
<td>if the provider isn't in the list of trusted providers</td>
</tr>
<tr>
<td>1</td>
<td>if the provider isn't in the list of trusted provider</td>
</tr>
<tr><tr><tr><tr></tr></tr></tr></tr>
<tr>
Expand Down Expand Up @@ -1027,7 +1031,7 @@ richness through sameAs by using network measures</i></td>
<th colspan="5" style="text-align: center;">Understandability</th>
</tr>
<tr>
<td rowspan="8">human-readable labelling of classes, properties and entities by providing rdfs:label</td>
<td rowspan="8">Human-readable labelling of classes, properties and entities by providing rdfs:label</td>
<td rowspan="8"><a href="https://bit.ly/3RtIeWV">bit.ly/3RtIeWV</a></td>
<td colspan="4"><i>no. of entities described by stating an rdfs:label or rdfs:comment in the dataset / total no. of entities described in the data</i></td>
</tr>
Expand Down Expand Up @@ -1098,7 +1102,7 @@ richness through sameAs by using network measures</i></td>
</tr>
<tr><tr><tr><tr></tr></tr></tr></tr>
<tr>
<td rowspan="8">indication of a regular expression that matches the URIs of a dataset</td>
<td rowspan="8">Indication of a regular expression that matches the URIs of a dataset</td>
<td rowspan="8"><a href="https://bit.ly/3RtIeWV">bit.ly/3RtIeWV</a></td>
<td colspan="4"><i>detecting whether a regular expression that matches the
URIs is present </i></td>
Expand All @@ -1124,7 +1128,7 @@ URIs is present </i></td>
</tr>
<tr><tr><tr><tr></tr></tr></tr></tr>
<tr>
<td rowspan="8">indication of the vocabularies used in the dataset</td>
<td rowspan="8">Indication of the vocabularies used in the dataset</td>
<td rowspan="8"><a href="https://bit.ly/3RtIeWV">bit.ly/3RtIeWV</a></td>
<td colspan="4"><i>checking whether a list of vocabularies used in the dataset is provided</i></td>
</tr>
Expand All @@ -1148,7 +1152,7 @@ URIs is present </i></td>
<th colspan="5" style="text-align: center;">Interpretability</th>
</tr>
<tr>
<td rowspan="8">no misinterpretation of missing values</td>
<td rowspan="8">No misinterpretation of missing values</td>
<td rowspan="8"><a href="https://bit.ly/3RtIeWV">bit.ly/3RtIeWV</a></td>
<td colspan="4"><i>detecting the use of blank nodes</i></td>
</tr>
Expand All @@ -1169,7 +1173,7 @@ URIs is present </i></td>
</tr>
<tr><tr><tr><tr></tr></tr></tr></tr>
<tr>
<td rowspan="8">atypical use of collections containers and reification</td>
<td rowspan="8">Atypical use of collections containers and reification</td>
<td rowspan="8"><a href="https://bit.ly/3RtIeWV">bit.ly/3RtIeWV</a></td>
<td colspan="4"><i>detection of the non-standard usage of collections, containers and reification</i></td>
</tr>
Expand Down
16 changes: 6 additions & 10 deletions docs/quality_dimensions/availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,11 @@ title: Accessibility category
---

## Availability
1. [SPARQL endpoint](#sparql-endpoint)
2. [RDF Dump](#rdf-dump)
3. [URIs dereferenciability](#uris-dereferenciability)
4. [Inactive links](#inactive-links)
1. [Accessibility of the SPARQL endpoint](#accessibility-of-the-sparql-endpoint)
2. [Accessibility of the RDF dump](#accessibility-of-the-rdf-dump)
3. [Derefereaceability of the URI](#derefereaceability-of-the-uri)

#### **SPARQL endpoint**
#### **Accessibility of the SPARQL endpoint**
First of we need to check that it is present
for the KG we are considering. The SPARQL endpoint link can be recovered in three different ways:
1. The first (easiest) is to analyze the metadata and search for the resource with the tag in the resources field api/sparql or whose key is sparql.
Expand All @@ -32,7 +31,7 @@ offline and given value 0.

---

#### **RDF dump**
#### **Accessibility of the RDF dump**
To check for the presence of the RDF dump we have three possible approaches:
1. We can analyze the metadata and check if in the resources field there are one or more resources with one of the following tags: ```application/rdf+xml```, ```text/turtle```, ```application/x-ntriples```, ```application/x-nquads```, ```text/n3```, ```rdf```,```text/rdf+n3```, ```rdf/turtle```.
2. Another method is to check inside the VoID file (if available). In this case we search for the triple having ```void:dataDump``` as its predicate.
Expand All @@ -49,7 +48,7 @@ Once the dump link has been retrieved, a simple HEAD request is made on the URL,

---

#### **URIs dereferenciability**
#### **Derefereaceability of the URI**
5000 triples (which contain URIs) are randomly retrieved with this query:

```sql
Expand All @@ -74,6 +73,3 @@ m_{def} = \frac{|Dereferencable(U_g)|}{|U_g|}
$$

---

#### **Inactive links**
All links present in the "resources" field in the metadata are recovered for the KG selected and a HEAD request is performed on each of this links. If there are links that are not active, the data is given a value of 0, otherwise 1.
22 changes: 1 addition & 21 deletions docs/quality_dimensions/believability.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,6 @@ title: Trust category

### Meta-information about the identity of information provider

#### **Title**
To recover the title, we simply analyze the KG metadata, in
in particular the “title” field.

---

#### **Description**
The description however, as with the title, can be recovered
from the metadata and is present in the “Description” field.

---

#### **Sources**
By KG source we mean all relevant information from the provider. It is a field present within the metadata and is structured as a list of values containing: the web address, name and provider email. The field has the key “sources”.

---

#### Reliable provider
The presence in a list of reliable providers is verified by recovering the keywords associated with the KG. These are present in the metadata in the "keyword" field. Among the many values it contains,
there is also the one relating to the provider. Then, the list is traversed and each value is compared with a list of providers deemed reliable. Since there is still no a list of reliable provider in the LOD panorama, we build this list by including the most well-known providers in the panorama of LOD. The list can be seen in the following table and is not to be considered exhaustive and definitive.
Expand All @@ -48,9 +31,6 @@ there is also the one relating to the provider. Then, the list is traversed and
<td>Academic</td>
</tr>
</table>

The value assigned to this metric will be 1 if the provider is in the list of trusted providers, 0 otherwise
---

#### **Trust value**
It is a score which is between 0 and 1 which helps the KG user to understand how much information about the believability is available.
In fact, this value is calculated as a weighted average based on how many of the following values are present: name, description, URL and presence in the reliable provider list. For each of these values, if the KG has it, 1 is assigned, otherwise 0. The sum is made and then divided by 4. The value obtained will be the trust value of the dataset. We also use this value to quantize the entire believability dimension, because this is a value which summarizes all the metrics.
1 change: 0 additions & 1 deletion docs/quality_dimensions/currency.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ title: Dataset dynamicity category
1. [Age of data](#age-of-data)
2. [Specification of the modification date of statements](#specification-of-the-modification-date-of-statements)
3. [Time since last modification](#time-since-last-modification)
4. [History of changes made](#history-of-changes-made)

#### **Age of data**
The value regarding the KG creation date can be obtained from the VoID file or by executing a query on the SPARQL endpoint. In the VoID file we look for a triple having $dcterms:created$ as predicate. Instead the query for the endpoint should be of the type:
Expand Down
4 changes: 2 additions & 2 deletions docs/quality_dimensions/interlinking.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: Accessibility category
1. [Degree of connection](#degree-of-connection)
2. [Clustering coefficient](#clustering-coefficient)
3. [Centrality](#centrality)
4. [Number of *same as* chains](#number-of-same-as-chains)
4. [sameAs chains](#sameas-chains)

For the caluculation of the Degree of connection, clustering coefficient and centrality, we utilize a tool for network measurement. We use a Python library named ```networkx``` for our purpose. In KGHeartbeat, the module called [```Graph.py```](https://github.com/isislab-unisa/KGHeartbeat/blob/main/Graph.py) is responsable to the caluculation of these three value. In particular, it is responsible for creating the graph that contains all the KGs that can be retrieved automatically from Internet. The external connections for every KG are analyzed (field
present in the metadata under the "external links" key) and for each connection we find, we insert the node inside the graph, labeled with the id of the KG and insert the edge with a weight equal to the number of triples with which it is connected to the other KGs. The process is then iterated for every KGs recovered. At the end of these process, on this Graph we calculate: *Degree of connection*, *Clustering coefficient* and *Centrality*.
Expand All @@ -24,7 +24,7 @@ The clustering coefficient (specifically here we calculate the local clustering
Centrality allows us to understand how important the KG is inside the graph and it is also a value between [0-1]. A higher centrality means a higher importance of the node, that is, it is involved in many connections. Instead, the lower it is, the more it means that those node is in the peripheral areas of the graph.

---
#### **Number of *same as* chains**
#### **sameAs chains**
In this case we use the following query which counts the number of triples that have the ```owl:sameAs``` predicate.

```sql
Expand Down
1 change: 0 additions & 1 deletion docs/quality_dimensions/licensing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ title: Accessibility category
## Licensing
1. [Machine-readable license](#machine-readable-license)
2. [Human-readable license](#human-readable-license)
3. [License in the metadata](#license-in-the-metadata)


#### **Machine-readable license**
Expand Down
8 changes: 4 additions & 4 deletions docs/quality_dimensions/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ title: Accessibility category
---

## Performance
1. [Latency](#latency)
2. [Throughput](#throughput)
1. [Low latency ](#low-latency)
2. [High Throughput](#high-throughput)

The values calculated in this case are latency and throughput. Since they are highly variable tests, they are repeated several times and the mean, standard deviation, maximum and minimum are calculated. In fact, the values could vary due to the difference in performance of our network over time or the load of the server where the SPARQL endpoint is located (as well as the performance of the server network itself).

#### **Latency**
#### **Low latency **
The test is repeated 5 times and involves the execution of one
simple query that retrieves a generic triple of the dataset and comes
measured the time between the request for the triple and when
Expand All @@ -22,5 +22,5 @@ LIMIT 1
To quantize the latency, if the latency is less than 1 second, then 1 is assigned to this metric. Otherwise we average the five latency measurements and divide by a 1000.

---
#### **Throughput**
#### **High Throughput**
Also in this case the test is repeated 5 times and we use the same previous query. But in this case we see in a second how many requests we can complete. The query executes in a while loop that stops after one second, and a count counter is incremented each time the query returns the result. At the end of each test, this variable will contain the number of requests and responses completed. To quantize the metric, if the throughput is greater than 5, we assign 1 to this metric. Otherwise we divide the throughput obtained by 200 and the value obtained is the value for the metric.
4 changes: 2 additions & 2 deletions docs/quality_dimensions/reputation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: Trust category
---

## Reputation
1. [PageRank](#pagerank)
1. [Reputation of the dataset ](#reputation-of-the-dataset)

#### **PageRank**
#### **Reputation of the dataset **
Since for the calculation of interlinking [here](./interlinking) we have built the graph containing all the KGs with the related external links, the function that calculates the PageRank on the node of interest (which corresponds to the KG we are analyzing) will be applied to this graph. The function used is the one made available by networkx and we pass it as input the ID of the KG whose PageRank we want to calculate. The function will return a value between 1 and 10. The closer the data is to 10, the more the KG has a high reputation and therefore of good quality. For quantize the metric, and get a value between 0 and 1, we divide the pagerank by 10.0

0 comments on commit 733c05a

Please sign in to comment.