Lab: SHACL: Difference between revisions

From info216
No edit summary
mNo edit summary
 
(18 intermediate revisions by 2 users not shown)
Line 8: Line 8:
* [https://book.validatingrdf.com/bookHtml011.html Chapter 5 ''SHACL''] in [https://book.validatingrdf.com/index.html Validating RDF] (available online)
* [https://book.validatingrdf.com/bookHtml011.html Chapter 5 ''SHACL''] in [https://book.validatingrdf.com/index.html Validating RDF] (available online)
* Interactive, online [https://shacl.org/playground/ SHACL Playground]
* Interactive, online [https://shacl.org/playground/ SHACL Playground]
* [https://docs.google.com/presentation/d/1weO9SzssxgYp3g_44X1LZsVtL0i6FurQ3KbIKZ8iriQ/ Lab presentation containing a short overview of SHACL and pySHACL]


pySHACL:
pySHACL:
* [https://pypi.org/project/pyshacl/ pySHACL at PyPi.org] ''After installation, go straight to "Python Module Use".''
* [https://pypi.org/project/pyshacl/ pySHACL at PyPi.org] ''(after installation, go straight to "Python Module Use".)''


==Tasks==
==Tasks==
'''Task:'''  
'''Task:'''  
Go to the interactive, online [https://shacl.org/playground/ SHACL Playground]. The file [File:xxx.txt] contains a small Turtle example you can paste into the Data Graph text field. The example is based on the ''kg4news.ttl'' graph introduced in the SPARQL lecture (S03). It contains several errors. Take some time to look at it in Turtle and also in JSON-LD, using the drop-down menu next to the ''Data Graph'' heading.
Go to the interactive, online [https://shacl.org/playground/ SHACL Playground]. Cut-and-paste the Turtle triples below into the Data Graph text field, and click ''Update''.
<syntaxhighlight>
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix ex: <http://example.org/> .
 
ex:Paul_Manafort
    a ex:PersonUnderInvestigation ;
    foaf:name
        "Paul Manafort"@en ; 
    ex:hasBusinessPartner ex:Rick_Gates .
 
ex:Rick_Gates
    a ex:PersonUnderInvestigation ;
    foaf:name
        "Rick Gates"@en ; 
    skos:altLabel
        "Richard William Gates III"@en ; 
    ex:chargedWith
        ex:ForeignLobbying , 
        ex:MoneyLaundering ,
        ex:TaxEvasion ;
    ex:pleadedGuilty
        ex:Conspiracy, [
                a ex:Lying ;
                ex:wasLyingTo ex:FBI
            ] .
 
ex:ForeignLobbying a ex:Offense . 
ex:MoneyLaundering a ex:Offense . 
ex:TaxEvasion a ex:Offense . 
 
</syntaxhighlight>
The example is based on Exercises 1 and 2. Take some time to look at it in Turtle and also in JSON-LD, using the drop-down menu next to the ''Data Graph'' heading.


'''Task:'''  
'''Task:'''  
Write Shapes Graphs in Turtle (recommended) or JSON-LD for each of the checks below. Keep copies of your of your Shape Graphs in a separate text editor and file. You will need them later. Each time you have entered a Shape Graph into the text field, click ''Update'' to validate the contents of the Data Graph.
Write Shapes Graphs in Turtle (recommended) or JSON-LD for each of the constraints below. Keep copies of your Shape Graphs in a separate text editor and file. You will need them later. Each time you have entered a Shape Graph into the text field, click ''Update'' to validate the contents of the Data Graph.
 
* Every kg:MainPaper has (is the subject of) exactly on kg:year property.
* Every kg:year value (literal object) is an integer.


You can use the following prefixes:
You can use the following prefixes:
  xxx
  @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ex: <http://example.org/> .


'''Task:''' Install pySHACL into your virtual environment:
Constraints:
pip install pyshacl
* Every person under investigation has exactly one name.
* The object of a charged with property must be a URI.
* The object of a charged with property must be an offense.
* All person names must be language-tagged (''hint:'' rdf:langString is a datatype!).


Change the ''data_graph'' to remove the detected errors as you go along (it is easier to read the outputs then).


'''Task:'''
Write a Python program using rdflib and pySHACL, which:
# parses the Turtle example above into a ''data_graph'' (''tip:'' you can either save it to file, or parse directly from a string using ''graph.parse(data=turtle_data, format='ttl')''),
# parses the contents of a ''shape_graph'' you made in the previous task (for example checking that every person under investigation has exactly one name),
# uses pySHACL's validate method to apply the ''shape_graph'' constraints to the  ''data_graph'', and
# prints out the validation result (a boolean value, a ''results_graph'', and a ''result_text'').


==If you have more time==
==If you have more time==
'''Task:'''
'''Task:'''
Fix ''kg4news.txt'' (renamed to ''.ttl'') so that:
Add the Turtle triples below (from exercise 3-5) to your ''data_graph''.
* Every kg:year value has rdf:type xsd:year .
<syntaxhighlight>
ex:investigation_162 a ex:Indictment ;
    ex:american "unknown" ;
    ex:cp_date "2018-02-23"^^xsd:date ;
    ex:cp_days 282 ;
    ex:indictment_days 166 ;
    ex:investigation ex:russia ;
    ex:investigation_days 659 ;
    ex:investigation_end "unknown" ;
    ex:investigation_start "2017-05-17"^^xsd:date ;
    foaf:name "Rick Gates" ;
    ex:investigatedPerson ex:Rick_Gates ;
    ex:outcome ex:guilty_plea ;
    ex:overturned false ;
    ex:pardoned false ;
    ex:president ex:Donald_Trump .
</syntaxhighlight>
 
Extend your shapes graph for each of these constraints:
* The only allowed values for ''ex:american'' are ''true'', ''false'' or ''unknown''.
* The value of a property that counts days must be an integer.
* The value of a property that indicates a start date must be ''xsd:date''.
* The value of a property that indicates an end date must be ''xsd:date'' or ''unknown'' (''tip:'' you can use ''sh:or (...)'' ).
* Every indictment must have exactly one FOAF name for the investigated person.
* Every indictment must have exactly one investigated person property, and that person must have the type ex:PersonUnderInvestigation.
* No URI-s can contain hyphens ('-').
* Presidents must be identified with URIs.
 
'''Task:'''
When you run SHACL on large data graphs, the ''results_graph'' and ''result_text'' will report the same error many times (but for different nodes). Write a SPARQL query to print out each distinct ''sh:resultMessage'' in the ''results_graph''.
 
'''Task:'''
Modify the above query so it prints out each ''sh:resultMessage'' in the ''results_graph'' once, along with the number of times that message has been repeated in the results.

Latest revision as of 13:13, 19 March 2024

Topics

  • Validating RDF graphs with SHACL
  • Running pySHACL

Useful materials

SHACL:

pySHACL:

Tasks

Task: Go to the interactive, online SHACL Playground. Cut-and-paste the Turtle triples below into the Data Graph text field, and click Update.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix ex: <http://example.org/> .

ex:Paul_Manafort 
    a ex:PersonUnderInvestigation ;
    foaf:name 
        "Paul Manafort"@en ;  
    ex:hasBusinessPartner ex:Rick_Gates .

ex:Rick_Gates 
    a ex:PersonUnderInvestigation ;
    foaf:name 
        "Rick Gates"@en ;  
    skos:altLabel 
        "Richard William Gates III"@en ;  
    ex:chargedWith 
        ex:ForeignLobbying ,  
        ex:MoneyLaundering ,
        ex:TaxEvasion ;
    ex:pleadedGuilty 
        ex:Conspiracy, [
                a ex:Lying ;
                ex:wasLyingTo ex:FBI 
            ] .

ex:ForeignLobbying a ex:Offense .  
ex:MoneyLaundering a ex:Offense .  
ex:TaxEvasion a ex:Offense .

The example is based on Exercises 1 and 2. Take some time to look at it in Turtle and also in JSON-LD, using the drop-down menu next to the Data Graph heading.

Task: Write Shapes Graphs in Turtle (recommended) or JSON-LD for each of the constraints below. Keep copies of your Shape Graphs in a separate text editor and file. You will need them later. Each time you have entered a Shape Graph into the text field, click Update to validate the contents of the Data Graph.

You can use the following prefixes:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ex: <http://example.org/> .

Constraints:

  • Every person under investigation has exactly one name.
  • The object of a charged with property must be a URI.
  • The object of a charged with property must be an offense.
  • All person names must be language-tagged (hint: rdf:langString is a datatype!).

Change the data_graph to remove the detected errors as you go along (it is easier to read the outputs then).

Task: Write a Python program using rdflib and pySHACL, which:

  1. parses the Turtle example above into a data_graph (tip: you can either save it to file, or parse directly from a string using graph.parse(data=turtle_data, format='ttl')),
  2. parses the contents of a shape_graph you made in the previous task (for example checking that every person under investigation has exactly one name),
  3. uses pySHACL's validate method to apply the shape_graph constraints to the data_graph, and
  4. prints out the validation result (a boolean value, a results_graph, and a result_text).

If you have more time

Task: Add the Turtle triples below (from exercise 3-5) to your data_graph.

ex:investigation_162 a ex:Indictment ;
    ex:american "unknown" ;
    ex:cp_date "2018-02-23"^^xsd:date ;
    ex:cp_days 282 ;
    ex:indictment_days 166 ;
    ex:investigation ex:russia ;
    ex:investigation_days 659 ;
    ex:investigation_end "unknown" ;
    ex:investigation_start "2017-05-17"^^xsd:date ;
    foaf:name "Rick Gates" ;
    ex:investigatedPerson ex:Rick_Gates ;
    ex:outcome ex:guilty_plea ;
    ex:overturned false ;
    ex:pardoned false ;
    ex:president ex:Donald_Trump .

Extend your shapes graph for each of these constraints:

  • The only allowed values for ex:american are true, false or unknown.
  • The value of a property that counts days must be an integer.
  • The value of a property that indicates a start date must be xsd:date.
  • The value of a property that indicates an end date must be xsd:date or unknown (tip: you can use sh:or (...) ).
  • Every indictment must have exactly one FOAF name for the investigated person.
  • Every indictment must have exactly one investigated person property, and that person must have the type ex:PersonUnderInvestigation.
  • No URI-s can contain hyphens ('-').
  • Presidents must be identified with URIs.

Task: When you run SHACL on large data graphs, the results_graph and result_text will report the same error many times (but for different nodes). Write a SPARQL query to print out each distinct sh:resultMessage in the results_graph.

Task: Modify the above query so it prints out each sh:resultMessage in the results_graph once, along with the number of times that message has been repeated in the results.