Lab: Wikidata in RDF

From info216

Topics

Wikidata in RDF:

  • retrieve primary triples about a Wikidata entity
  • load the semantic data and metadata into GraphDB
  • visualise the semantic data and metadata

Motivation: So far you have built your own knowledge graph and worked on a small grap you were given. This week we will look at how to retrieve knowledge graphs from Wikidata, which can then be merged with your own graph to provide additional context. This is not a trivial problem because Wikidata most likely contains a lot more data - and in particular metadata - than you need.

Useful materials

Tasks

Getting ready: In a web browser, go to Wikidata's Query Service (WDQS). Be careful to always use a limit like LIMIT 100 when you test things. Otherwise, you risk being blocked from the query service or, worse, you risk blocking out a whole subdomain.

Emergency data: If Wikdata's Query Service is unavailable, you can load this Turtle file into GraphDB instead, and continue there using Q42 as your example entity. (Remember to rename the file from .tar to .ttl - it is not a .tar-file.)

Task: From Wikidata's ordinary UI, find the Q-code of one of the people or entities involved in the Mueller investigation. Use that entity as your reference in the rest of this lab. (The Q-code should look like this https://www.wikidata.org/entity/Q42 or wd:Q42.)

Task: Use a DESCRIBE query to retrieve some triples about your entity (remember LIMIT 100, although it is less critical on DESCRIBE queries).

Task: Use a SELECT query to retrieve the first 100 triples about your entity.

Tip: Always save your queries and updates as soon as they succeed. You may need to go back to them later.

Task: Start GraphDB on your local machine. Create a new repository (No inference needed), and activate it. Write a local SELECT query that embeds a <https://query.wikidata.org/bigdata/namespace/wdq/sparql> SERVICE query to retrieve the first 100 triples about your entity to your local machine.

Tip: wd: is a PREFIX for <http://www.wikidata.org/entity/>.

Tip: To make LIMIT work inside a SERVICE query, you have to add another SELECT inside it, like this:

SELECT ... {  # the local query
    SERVICE ... {  # the remote service
        SELECT ... {
            ...
        } LIMIT 100  # this limit works on the remote service
    }
}  # a limit here would work on your local service, 
   # but is not strictly necessary when you already have an inner LIMIT

Task: Change the SELECT query to an INSERT query that adds the Wikidata triples your local repository. Use a local ASK and/or SELECT query to check that the triples have actually been added.

Task: Go back to the Wikidata Query Service (WDQS). (You can run the rest of the lab using a remote SERVICE inside GraphDB, but using WDQS might give you better error messages etc.)

Primary Wikidata statements use the prefix wd: for resources and wdt: for predicates. Use a FILTER statement to only SELECT primary triples in this sense.

These PREFIXes are built into WDQS, but you will need them if you run inside GraphDB:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

Task: Use Wikidata's in-built SERVICE wikibase:label to get labels for all the object resources. (Autocompletion with Ctrl-Space will help you set up the service.)

Afterwards, instead of SELECT * or SELECT ?p ?o, you can write SELECT ?p ?o ?oLabel to see the labels for all resource objects.

Task: You now have labels for all the resource objects, but you have no primary triples with literal values. Edit your query (by relaxing the FILTER expression) so it also returns triples where the object has DATATYPE xsd:string.

Task: You still do not have the "fingerprint" triples, i.e., the label, aliases and description of your reference entity. Wikidata uses special properties like rdfs:label, skos:altLabel and schema:description for these. Relax the FILTER expression again so it also returns triples with these three predicates.

PREFIXes you may need:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX schema: <http://schema.org/>

Task: Now you may have too many fingerprint triples! Try to restrict the FILTER expression again so that, when the predicate is rdfs:label, skos:altLabel and schema:description, the object must have LANG "en".

Task: Go back to your local GraphDB, and run your Wikidata SELECT query inside a SERVICE statement like you did before. (You must declare all PREFIXes now.)

Task: Change the SELECT query to an INSERT query that adds the Wikidata triples your local repository. Go to Explorer -> Graph view and enter the URI (Q-code) of your reference entity. You can click the cogwheel to extend the number of relations shown for the reference node.

If you have more time...

Task: An earlier task returned labels for all the resource objects, but predicates have labels too. Unfortunately, the wikibase:label SERVICE only provides labels for entities with a wd: prefix. You must therefore REPLACE all wdt: prefixes of properties with wd: prefixes and BIND the new URI AS a new variable, for example ?pw.

Tip: Use the STR(...) and IRI(...) functions carefully: URIs and prefixes are not strings, and strings are not URIs.

Afterwards, instead of SELECT ?p ?o ?oLabel, you can write SELECT ?p ?pwLabel ?o ?oLabel to see the labels for both predicates and resource objects.

Task: Now you can go back to the SELECT statement that returned primary triples with only resource objects (not literal objects or fingerprints). Extend it so it also includes primary triples "one step out", i.e., triples where the subjects are objects of triples involving your reference entity.

In your local GraphDB, use an INSERT statement to add the triples to your local repository. Use Explorer and Graph view to visualise the extended graph. You can click on nodes and Extend them to see their neighbouring nodes.